1. 18 4月, 2019 2 次提交
    • J
      VersionSet: optmize GetOverlappingInputsRangeBinarySearch (#4987) · 5b7e09bd
      JiYou 提交于
      Summary:
      `GetOverlappingInputsRangeBinarySearch` firstly use binary search
      to find a index in the given range `[begin, end]`. But after find
      the index, then use linear search to find the `start_index` and
      `end_index`. So the search process degraded to linear time.
      
      Here optmize the search process with below changes:
      
      - use `std::lower_bound` and `std::upper_bound` to get
        `lg(n)` search complexity.
      - use uniformed lambda for search process.
      - simplify process for `within_interval` true or false.
      - remove function `ExtendFileRangeWithinInterval`
        and `ExtendFileRangeOverlappingInterval`.
      Signed-off-by: NJiYou <jiyou09@gmail.com>
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987
      
      Differential Revision: D14984192
      
      Pulled By: riversand963
      
      fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
      5b7e09bd
    • Z
      rename variable to avoid shadowing (#5204) · 248b6b55
      Zhongyi Xie 提交于
      Summary:
      this PR fixes the following compile warning:
      ```
      db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::Seek(const rocksdb::Slice&)’:
      db/memtable.cc:321:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
             Slice user_key(ExtractUserKey(k));
                            ^
      db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::SeekForPrev(const rocksdb::Slice&)’:
      db/memtable.cc:338:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
             Slice user_key(ExtractUserKey(k));
                            ^
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5204
      
      Differential Revision: D14970160
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 388eb089f90c4528cc6d615dd4607fb53ceac705
      248b6b55
  2. 17 4月, 2019 4 次提交
    • Z
      Avoid double-compacting data in bottom level in manual compactions (#5138) · baa53024
      Zhongyi Xie 提交于
      Summary:
      Depending on the config, manual compaction (leveled compaction style) does following compactions:
      L0->L1
      L1->L2
      ...
      Ln-1 -> Ln
      Ln -> Ln
      The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
      Resolves issue https://github.com/facebook/rocksdb/issues/4995
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
      
      Differential Revision: D14940106
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
      baa53024
    • Y
      Add back NewEmptyIterator (#5203) · d9280ff2
      Yanqin Jin 提交于
      Summary:
      #4905 removed the implementation of `NewEmptyIterator` but kept its
      declaration in the public header. This breaks some systems that depend on
      RocksDB if the systems use `NewEmptyIterator`. Therefore, add it back to fix. cc maysamyabandeh please remind me if I miss anything here. Thanks
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5203
      
      Differential Revision: D14968382
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5fb86e99c8cfaf9f7a9473cdb1355d7558ff6e01
      d9280ff2
    • S
      WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175) · beb44ec3
      Siying Dong 提交于
      Summary:
      Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
      Change it to 256KB accordingly.
      There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175
      
      Differential Revision: D14954289
      
      Pulled By: siying
      
      fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
      beb44ec3
    • Y
      Avoid per-key upper bound check in BlockBasedTableIterator (#5142) · f1239d5f
      yiwu-arbug 提交于
      Summary:
      This is second attempt for #5101. Original commit message:
      `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
      
      This patch come with two fixes:
      
      Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
      
      Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
      
      Differential Revision: D14907113
      
      Pulled By: siying
      
      fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
      f1239d5f
  3. 16 4月, 2019 5 次提交
    • V
      Consolidating WAL creation which currently has duplicate logic in... · 71a82a0a
      Vijay Nadimpalli 提交于
      Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188)
      
      Summary:
      Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188
      
      Differential Revision: D14942832
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
      71a82a0a
    • Y
      Fix MultiGet ASSERT bug when passing unsorted result (#5195) · 3e63e553
      Yi Zhang 提交于
      Summary:
      Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.
      
      I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.
      
      anand1976
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195
      
      Differential Revision: D14935475
      
      Pulled By: yizhang82
      
      fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
      3e63e553
    • Y
      db_bench: support seek to non-exist prefix (#5163) · b70967aa
      Yi Wu 提交于
      Summary:
      Add `--seek_missing_prefix` flag to db_bench to allow benchmarking seeking to non-existing prefix. Usage example:
      ```
      ./db_bench --db=/dev/shm/db_bench --use_existing_db=false --benchmarks=fillrandom --num=100000000 --prefix_size=9 --keys_per_prefix=10
      ./db_bench --db=/dev/shm/db_bench --use_existing_db=true --benchmarks=seekrandom --disable_auto_compactions=true --num=100000000 --prefix_size=9 --keys_per_prefix=10 --reads=1000 --prefix_same_as_start=true --seek_missing_prefix=true
      ```
      Also adding `--total_order_seek` and `--prefix_same_as_start` flags.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5163
      
      Differential Revision: D14935724
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7c41023f007febe373eb1589861f215432a9e18a
      b70967aa
    • F
      Update history and version to 6.1.1 (#5171) · b5cad5c9
      Fosco Marotto 提交于
      Summary:
      Including latest fixes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5171
      
      Differential Revision: D14875157
      
      Pulled By: gfosco
      
      fbshipit-source-id: 86ec7ee3553a9b25ab71ed98966ce08a16322e2c
      b5cad5c9
    • J
      Improve transaction lock details (#5193) · 8295d364
      jsteemann 提交于
      Summary:
      This branch contains two small improvements:
      * Create `LockMap` entries using `std::make_shared`. This saves one heap allocation per LockMap entry but also locates the control block and the LockMap object closely together in memory, which can help with caching
      * Reorder the members of `TrackedTrxInfo`, so that the resulting struct uses less memory (at least on 64bit systems)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5193
      
      Differential Revision: D14934536
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f7b49812bb4b6029eef9d131e7cd56260df5b28e
      8295d364
  4. 13 4月, 2019 7 次提交
    • A
      Add bounds check in FilePickerMultiGet::PrepareNextLevel() (#5189) · 29111e92
      anand76 提交于
      Summary:
      Add bounds check when looping through empty levels in FilePickerMultiGet
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5189
      
      Differential Revision: D14925334
      
      Pulled By: anand1976
      
      fbshipit-source-id: 65d53247cf443153e28ce2b8b753fa51c6ae4566
      29111e92
    • Y
      Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190) · cca141ec
      yiwu-arbug 提交于
      Summary:
      Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190
      
      Differential Revision: D14923773
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
      cca141ec
    • M
      Remove extraneous call to TrackKey (#5173) · d655a3aa
      Manuel Ung 提交于
      Summary:
      In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.
      
      For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.
      
      This is another 2-3% cpu improvement.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173
      
      Differential Revision: D14883809
      
      Pulled By: lth
      
      fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
      d655a3aa
    • M
      WritePrepared: fix race condition in reading batch with duplicate keys (#5147) · fe642cbe
      Maysam Yabandeh 提交于
      Summary:
      When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
      The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
      A unit  test is added to reproduce the race condition with duplicate keys.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
      
      Differential Revision: D14758815
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
      fe642cbe
    • A
      Expose JavaAPI for getting the filter policy of a BlockBasedTableConfig (#5186) · 1966a7c0
      ableegoldman 提交于
      Summary:
      I would like to be able to read out the current Filter that has been set (or not) for a BlockBasedTableConfig. Added one public method to BlockBasedTableConfig:
      
      public Filter filterPolicy() {
          return filterPolicy;
      }
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5186
      
      Differential Revision: D14921415
      
      Pulled By: siying
      
      fbshipit-source-id: 2a63c8685480197862b49fc48916c757cd6daf95
      1966a7c0
    • S
      Still implement StatisticsImpl::measureTime() (#5181) · 85b2bde3
      Siying Dong 提交于
      Summary:
      Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.
      
      Also, add a test to make sure that a Statistics implementation using the old interface still works.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181
      
      Differential Revision: D14907089
      
      Pulled By: siying
      
      fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
      85b2bde3
    • Y
      Fix bugs detected by clang analyzer (#5185) · 3189398c
      Yanqin Jin 提交于
      Summary:
      as titled. False positive included, fixed anyway to make the check
      pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185
      
      Differential Revision: D14909384
      
      Pulled By: riversand963
      
      fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3
      3189398c
  5. 12 4月, 2019 3 次提交
    • V
      Added missing table properties in log (#5168) · f49e12b8
      vijaynadimpalli 提交于
      Summary:
      When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168
      
      Differential Revision: D14876928
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
      f49e12b8
    • A
      Introduce a new MultiGet batching implementation (#5011) · fefd4b98
      anand76 提交于
      Summary:
      This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
      
      Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
      1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
      2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
      
      The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
      
      Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
      
      Batch   Sizes
      
      1        | 2        | 4         | 8      | 16  | 32
      
      Random pattern (Stride length 0)
      4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
      4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
      4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)
      
      Good locality (Stride length 16)
      4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
      4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
      4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
      
      Good locality (Stride length 256)
      4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
      4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
      4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
      
      Medium locality (Stride length 4096)
      4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
      4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
      4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
      
      dbbench command used (on a DB with 4 levels, 12 million keys)-
      TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
      
      Differential Revision: D14348703
      
      Pulled By: anand1976
      
      fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
      fefd4b98
    • S
      Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165) · ed9f5e21
      Siying Dong 提交于
      Summary:
      Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
      Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
      
      Differential Revision: D14880709
      
      Pulled By: siying
      
      fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
      ed9f5e21
  6. 11 4月, 2019 2 次提交
    • S
      Periodic Compactions (#5166) · d3d20dcd
      Sagar Vemuri 提交于
      Summary:
      Introducing Periodic Compactions.
      
      This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.
      
      - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
      - This works across all levels.
      - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
      - Compaction filters, if any, are invoked as usual.
      - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
      
      This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
      
      Differential Revision: D14884441
      
      Pulled By: sagar0
      
      fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
      d3d20dcd
    • M
      Reduce copies of LockInfo (#5172) · ef0fc1b4
      Manuel Ung 提交于
      Summary:
      The LockInfo struct is not easy to copy because it contains std::vector. Reduce copies by using move constructor and `unordered_map::emplace`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5172
      
      Differential Revision: D14882053
      
      Pulled By: lth
      
      fbshipit-source-id: 93999ec6ab1a5841fb5115abb764b6c1831a6de1
      ef0fc1b4
  7. 09 4月, 2019 3 次提交
    • J
      fix reading encrypted files beyond file boundaries (#5160) · 313e8772
      jsteemann 提交于
      Summary:
      This fix should help reading from encrypted files if the file-to-be-read
      is smaller than expected. For example, when using the encrypted env and
      making it read a journal file of exactly 0 bytes size, the encrypted env
      code crashes with SIGSEGV in its Decrypt function, as there is no check
      if the read attempts to read over the file's boundaries (as specified
      originally by the `dataSize` parameter).
      
      The most important problem this patch addresses is however that there is
      no size underlow check in `CTREncryptionProvider::CreateCipherStream`:
      
      The stream to be read will be initialized to a size of always
      `prefix.size() - (2 * blockSize)`. If the prefix however is smaller than
      twice the block size, this will obviously assume a _very_ large stream
      and read over the bounds. The patch adds a check here as follows:
      
          // If the prefix is smaller than twice the block size, we would below read a
          // very large chunk of the file (and very likely read over the bounds)
          assert(prefix.size() >= 2 * blockSize);
          if (prefix.size() < 2 * blockSize) {
            return Status::Corruption("Unable to read from file " + fname + ": read attempt would read beyond file bounds");
          }
      
      so embedders can catch the error in their release builds.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5160
      
      Differential Revision: D14834633
      
      Pulled By: sagar0
      
      fbshipit-source-id: 47aa39a6db8977252cede054c7eb9a663b9a3484
      313e8772
    • S
      Consolidate hash function used for non-persistent data in a new function (#5155) · 0bb55563
      Siying Dong 提交于
      Summary:
      Create new function NPHash64() and GetSliceNPHash64(), which are currently
      implemented using murmurhash.
      Replace the current direct call of murmurhash() to use the new functions
      if the hash results are not used in on-disk format.
      This will make it easier to try out or switch to alternative functions
      in the uses where data format compatibility doesn't need to be considered.
      This part shouldn't have any performance impact.
      
      Also, the sharded cache hash function is changed to the new format, because
      it falls into this categoery. It doesn't show visible performance impact
      in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
      in an extreme benchmark setting (4KB blocks, no-compression, everything
      cached in block cache). We've known that the current hash function used,
      our own Hash() has serious hash quality problem. It can generate a lots of
      conflicts with similar input. In this use case, it means extra lock contention
      for reads from the same file. This slight CPU regression is worthy to me
      to counter the potential bad performance with hot keys. And hopefully this
      will get further improved in the future with a better hash function.
      
      cache_test's condition is relaxed a little bit to. The new hash is slightly
      more skewed in this use case, but I manually checked the data and see
      the hash results are still in a reasonable range.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155
      
      Differential Revision: D14834821
      
      Pulled By: siying
      
      fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
      0bb55563
    • Y
      Refactor ExternalSSTFileTest (#5129) · de00f281
      Yanqin Jin 提交于
      Summary:
      remove an unnecessary function `GenerateAndAddFileIngestBehind`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5129
      
      Differential Revision: D14686710
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5698ae63e10f8ef76c2da753bbb07a36024ac065
      de00f281
  8. 06 4月, 2019 3 次提交
    • S
      Expose DB methods to lock and unlock the WAL (#5146) · 39c6c5fc
      Sergei Glushchenko 提交于
      Summary:
      Expose DB methods to lock and unlock the WAL.
      
      These methods are intended to use by MyRocks in order to obtain WAL
      coordinates in consistent way.
      
      Usage scenario is following:
      
      MySQL has performance_schema.log_status which provides information that
      enables a backup tool to copy the required log files without locking for
      the duration of copy. To populate this table MySQL does following:
      
      1. Lock the binary log. Transactions are not allowed to commit now
      2. Save the binary log coordinates
      3. Walk through the storage engines and lock writes on each engine. For
         InnoDB, redo log is locked. For MyRocks, WAL should be locked.
      4. Ask storage engines for their coordinates. InnoDB reports its current
         LSN and checkpoint LSN. MyRocks should report active WAL files names
         and sizes.
      5. Release storage engine's locks
      6. Unlock binary log
      
      Backup tool will then use this information to copy InnoDB, RocksDB and
      MySQL binary logs up to specified positions to end up with consistent DB
      state after restore.
      
      Currently, RocksDB allows to obtain the list of WAL files. Only missing
      bit is the method to lock the writes to WAL files.
      
      LockWAL method must flush the WAL in order for the reported size to be
      accurate (GetSortedWALFiles is using file system stat call to return the
      file size), also, since backup tool is going to copy the WAL, it is
      better to be flushed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146
      
      Differential Revision: D14815447
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
      39c6c5fc
    • S
      Add final annotations to some cache functions (#5156) · 479c5667
      Siying Dong 提交于
      Summary:
      cache functions heavily use virtual functions.
      Add some "final" annotations to give compilers more information
      to optimize. The compiler doesn't seem to take advantage of it
      though. But it doesn't hurt.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5156
      
      Differential Revision: D14814837
      
      Pulled By: siying
      
      fbshipit-source-id: 4423f58eafc93f7dd3c5f04b02b5c993dba2ea94
      479c5667
    • H
      Removed const fields in copyable classes (#5095) · 8d1e5216
      Harry Wong 提交于
      Summary:
      This fixed the compile error in Clang-8:
      ```
      error: explicitly defaulted copy assignment operator is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5095
      
      Differential Revision: D14811961
      
      Pulled By: riversand963
      
      fbshipit-source-id: d935d1f85a4e8694dca10033fb5af92d8777eca0
      8d1e5216
  9. 05 4月, 2019 4 次提交
    • L
      Evict the uncompression dictionary from the block cache upon table close (#5150) · 59ef2ba5
      Levi Tamasi 提交于
      Summary:
      The uncompression dictionary object has a Statistics pointer that might
      dangle if the database closed. This patch evicts the dictionary from the
      block cache when a table is closed, similarly to how index and filter
      readers are handled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5150
      
      Differential Revision: D14782422
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 0cec9336c742c479aa92206e04521767f1aa9622
      59ef2ba5
    • M
      Add missing methods to EnvWrapper, and more wrappers in Env.h (#5131) · 306b9adf
      Mike Kolupaev 提交于
      Summary:
      - Some newer methods of Env weren't wrapped in EnvWrapper. Fixed.
       - Added more wrapper classes similar to WritableFileWrapper: SequentialFileWrapper, RandomAccessFileWrapper, RandomRWFileWrapper, DirectoryWrapper, LoggerWrapper.
       - Moved the code around a bit, removed some unused friendships, added some comments.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5131
      
      Differential Revision: D14738932
      
      Pulled By: al13n321
      
      fbshipit-source-id: 99a9b1af28f2c629e7b7501389fa920b5ce30218
      306b9adf
    • A
      Fix many bugs in log statement arguments (#5089) · c06c4c01
      Adam Simpkins 提交于
      Summary:
      Annotate all of the logging functions to inform the compiler that these
      use printf-style formatting arguments.  This allows the compiler to emit
      warnings if the format arguments are incorrect.
      
      This also fixes many problems reported now that format string checking
      is enabled.  Many of these are simply mix-ups in the argument type (e.g,
      int vs uint64_t), but in several cases the wrong number of arguments
      were being passed in which can cause the code to crash.
      
      The primary motivation for this was to fix the log message in
      `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
      format parameter with no argument supplied.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089
      
      Differential Revision: D14574795
      
      Pulled By: simpkins
      
      fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
      c06c4c01
    • D
      #5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when... · f0edf9d5
      datonli 提交于
      #5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152)
      
      Summary:
      mv port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5152
      
      Differential Revision: D14779409
      
      Pulled By: siying
      
      fbshipit-source-id: d4162c47c979c6e8cc6a9e601802864ab3768ecb
      f0edf9d5
  10. 04 4月, 2019 3 次提交
  11. 03 4月, 2019 4 次提交
    • Z
      add assert to silence clang analyzer and fix variable shadowing (#5140) · e8480d4d
      Zhongyi Xie 提交于
      Summary:
      This PR address two open issues:
      
      1.  clang analyzer is paranoid about db_ being nullptr after DB::Open calls in the test.
      See https://github.com/facebook/rocksdb/pull/5043#discussion_r271394579
      Add an assert to keep clang happy
      2. PR https://github.com/facebook/rocksdb/pull/5049 introduced a  variable shadowing:
      ```
      db/db_iterator_test.cc: In constructor ‘rocksdb::DBIteratorWithReadCallbackTest_ReadCallback_Test::TestBody()::TestReadCallback::TestReadCallback(rocksdb::SequenceNumber)’:
      db/db_iterator_test.cc:2484:9: error: declaration of ‘max_visible_seq’ shadows a member of 'this' [-Werror=shadow]
               : ReadCallback(max_visible_seq) {}
               ^
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5140
      
      Differential Revision: D14735497
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 3219ea75cf4ae04f64d889323f6779e84be98144
      e8480d4d
    • M
      Mark logs with prepare in PreReleaseCallback (#5121) · 5234fc1b
      Maysam Yabandeh 提交于
      Summary:
      In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
      
      Differential Revision: D14665210
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
      5234fc1b
    • Z
      add compression options to table properties (#5081) · 26015f3b
      Zhongyi Xie 提交于
      Summary:
      Since we are planning to use dictionary compression and to use different compression level, it is quite useful to add compression options to TableProperties. For example, in MyRocks, if the feature is available, we can query from information_schema.rocksdb_sst_props to see if all sst files are converted to ZSTD dictionary compressions. Resolves https://github.com/facebook/rocksdb/issues/4992
      
      With this PR, user can query table properties through `GetPropertiesOfAllTables` API and get compression options as std::string:
      `window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;`
      or table_properties->ToString() will also contain it
      `# data blocks=1; # entries=13; # deletions=0; # merge operands=0; # range deletions=0; raw key size=143; raw average key size=11.000000; raw value size=39; raw average value size=3.000000; data block size=120; index block size (user-key? 0, delta-value? 0)=27; filter block size=0; (estimated) table size=147; filter policy name=N/A; prefix extractor name=nullptr; column family ID=0; column family name=default; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=1552946632; time stamp of earliest key=1552946632;`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5081
      
      Differential Revision: D14716692
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 7d2f2cf84e052bff876e71b4212cfdebf5be32dd
      26015f3b
    • M
      WriteUnPrepared: less virtual in iterator callback (#5049) · 14b3f683
      Maysam Yabandeh 提交于
      Summary:
      WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
      The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
      Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
      
      The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
      
      Benchmark:
      ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
      
      ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
      seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec
      
      ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
      seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
      
      Differential Revision: D14366459
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
      14b3f683