1. 22 10月, 2019 3 次提交
    • S
      LevelIterator to avoid gap after prefix bloom filters out a file (#5861) · a0cd9200
      sdong 提交于
      Summary:
      Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861
      
      Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test
      
      Differential Revision: D17918213
      
      fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
      a0cd9200
    • S
      Fix VerifyChecksum readahead with mmap mode (#5945) · 30e2dc02
      sdong 提交于
      Summary:
      A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap
       mode.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945
      
      Test Plan: Add a unit test that used to fail.
      
      Differential Revision: D18021443
      
      fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff
      30e2dc02
    • S
      Fix some dependency paths (#5946) · 1a21afa7
      sdong 提交于
      Summary:
      Some dependency path is not correct so that ASAN cannot run with CLANG. Fix it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5946
      
      Test Plan: Run ASAN with CLANG
      
      Differential Revision: D18040933
      
      fbshipit-source-id: 1d82be9d350485cf1df1c792dad765188958641f
      1a21afa7
  2. 19 10月, 2019 7 次提交
    • L
      Store the filter bits reader alongside the filter block contents (#5936) · 29ccf207
      Levi Tamasi 提交于
      Summary:
      Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
      only the filter block contents are stored in the block cache (as opposed to the
      earlier design where the cache stored the filter block reader itself, leading to
      potentially dangling pointers and concurrency bugs). However, this change
      introduced a performance hit since with the new code, the metadata fields are
      re-parsed upon every access. This patch reunites the block contents with the
      filter bits reader to eliminate this overhead; since this is still a self-contained
      pure data object, it is safe to store it in the cache. (Note: this is similar to how
      the zstd digest is handled.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936
      
      Test Plan:
      make asan_check
      
      filter_bench results for the old code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.7153
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.4258
        Single filter ns/op: 42.5974
        Random filter ns/op: 217.861
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.4217
        Single filter ns/op: 50.9855
        Random filter ns/op: 219.167
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5172
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 32.3556
        Single filter ns/op: 83.2239
        Random filter ns/op: 370.676
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.2265
        Single filter ns/op: 93.5651
        Random filter ns/op: 408.393
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      With the new code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 25.4285
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 31.0594
        Single filter ns/op: 43.8974
        Random filter ns/op: 226.075
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 31.0295
        Single filter ns/op: 50.3824
        Random filter ns/op: 226.805
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5308
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.2968
        Single filter ns/op: 58.6163
        Random filter ns/op: 291.434
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.1839
        Single filter ns/op: 66.9039
        Random filter ns/op: 292.828
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      Differential Revision: D17991712
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
      29ccf207
    • Y
      Fix TestIterate for HashSkipList in db_stress (#5942) · c53db172
      Yanqin Jin 提交于
      Summary:
      Since SeekForPrev (used by Prev) is not supported by HashSkipList when prefix is used, we disable it when stress testing HashSkipList.
      
      - Change the default memtablerep to skip list.
      - Avoid Prev() when memtablerep is HashSkipList and prefix is used.
      
      Test Plan (on devserver):
      ```
      $make db_stress
      $./db_stress -ops_per_thread=10000 -reopen=1 -destroy_db_initially=true -column_families=1 -threads=1 -column_families=1 -memtablerep=prefix_hash
      $# or simply
      $./db_stress
      $./db_stress -memtablerep=prefix_hash
      ```
      Results must print "Verification successful".
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5942
      
      Differential Revision: D18017062
      
      Pulled By: riversand963
      
      fbshipit-source-id: af867e59aa9e6f533143c984d7d529febf232fd7
      c53db172
    • P
      Refactor / clean up / optimize FullFilterBitsReader (#5941) · 5f8f2fda
      Peter Dillinger 提交于
      Summary:
      FullFilterBitsReader, after creating in BloomFilterPolicy, was
      responsible for decoding metadata bits. This meant that
      FullFilterBitsReader::MayMatch had some metadata checks in order to
      implement "always true" or "always false" functionality in the case
      of inconsistent or trivial metadata. This made for ugly
      mixing-of-concerns code and probably had some runtime cost. It also
      didn't really support plugging in alternative filter implementations
      with extensions to the existing metadata schema.
      
      BloomFilterPolicy::GetFilterBitsReader is now (exclusively) responsible
      for decoding filter metadata bits and constructing appropriate instances
      deriving from FilterBitsReader. "Always false" and "always true" derived
      classes allow FullFilterBitsReader not to be concerned with handling of
      trivial or inconsistent metadata. This also makes for easy expansion
      to alternative filter implementations in new, alternative derived
      classes. This change makes calls to FilterBitsReader::MayMatch
      *necessarily* virtual because there's now more than one built-in
      implementation. Compared with the previous implementation's extra
      'if' checks in MayMatch, there's no consistent performance difference,
      measured by (an older revision of) filter_bench (differences here seem
      to be within noise):
      
          Inside queries...
          -  Dry run (407) ns/op: 35.9996
          +  Dry run (407) ns/op: 35.2034
          -  Single filter ns/op: 47.5483
          +  Single filter ns/op: 47.4034
          -  Batched, prepared ns/op: 43.1559
          +  Batched, prepared ns/op: 42.2923
          ...
          -  Random filter ns/op: 150.697
          +  Random filter ns/op: 149.403
          ----------------------------
          Outside queries...
          -  Dry run (980) ns/op: 34.6114
          +  Dry run (980) ns/op: 34.0405
          -  Single filter ns/op: 56.8326
          +  Single filter ns/op: 55.8414
          -  Batched, prepared ns/op: 48.2346
          +  Batched, prepared ns/op: 47.5667
          -  Random filter ns/op: 155.377
          +  Random filter ns/op: 153.942
               Average FP rate %: 1.1386
      
      Also, the FullFilterBitsReader ctor was responsible for a surprising
      amount of CPU in production, due in part to inefficient determination of
      the CACHE_LINE_SIZE used to construct the filter being read. The
      overwhelming common case (same as my CACHE_LINE_SIZE) is now
      substantially optimized, as shown with filter_bench with
      -new_reader_every=1 (old option - see below) (repeatable result):
      
          Inside queries...
          -  Dry run (453) ns/op: 118.799
          +  Dry run (453) ns/op: 105.869
          -  Single filter ns/op: 82.5831
          +  Single filter ns/op: 74.2509
          ...
          -  Random filter ns/op: 224.936
          +  Random filter ns/op: 194.833
          ----------------------------
          Outside queries...
          -  Dry run (aa1) ns/op: 118.503
          +  Dry run (aa1) ns/op: 104.925
          -  Single filter ns/op: 90.3023
          +  Single filter ns/op: 83.425
          ...
          -  Random filter ns/op: 220.455
          +  Random filter ns/op: 175.7
               Average FP rate %: 1.13886
      
      However PR#5936 has/will reclaim most of this cost. After that PR, the optimization of this code path is likely negligible, but nonetheless it's clear we aren't making performance any worse.
      
      Also fixed inadequate check of consistency between filter data size and
      num_lines. (Unit test updated.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5941
      
      Test Plan:
      previously added unit tests FullBloomTest.CorruptFilters and
      FullBloomTest.RawSchema
      
      Differential Revision: D18018353
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 8e04c2b4a7d93223f49a237fd52ef2483929ed9c
      5f8f2fda
    • P
      Fix PlainTableReader not to crash sst_dump (#5940) · fe464bca
      Peter Dillinger 提交于
      Summary:
      Plain table SSTs could crash sst_dump because of a bug in
      PlainTableReader that can leave table_properties_ as null. Even if it
      was intended not to keep the table properties in some cases, they were
      leaked on the offending code path.
      
      Steps to reproduce:
      
          $ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
          $ sst_dump --file=0000xx.sst --show_properties
          from [] to []
          Process /dev/shm/dbbench/000014.sst
          Sst file format: plain table
          Raw user collected properties
          ------------------------------
          Segmentation fault (core dumped)
      
      Also added missing unit testing of plain table full_scan_mode, and
      an assertion in NewIterator to check for regression.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940
      
      Test Plan: new unit test, manual, make check
      
      Differential Revision: D18018145
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
      fe464bca
    • Z
      Enable trace_replay with multi-threads (#5934) · 526e3b97
      Zhichao Cao 提交于
      Summary:
      In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934
      
      Test Plan: test with make check and real trace replay.
      
      Differential Revision: D17998098
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8
      526e3b97
    • L
      Update HISTORY.md with recent BlobDB adjacent changes · 69bd8a28
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5939
      
      Differential Revision: D18009096
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 032a48a302f9da38aecf4055b5a8d4e1dffd9dc7
      69bd8a28
    • Y
      Expose db stress tests (#5937) · e60cc092
      Yanqin Jin 提交于
      Summary:
      expose db stress test by providing db_stress_tool.h in public header.
      This PR does the following:
      - adds a new header, db_stress_tool.h, in include/rocksdb/
      - renames db_stress.cc to db_stress_tool.cc
      - adds a db_stress.cc which simply invokes a test function.
      - update Makefile accordingly.
      
      Test Plan (dev server):
      ```
      make db_stress
      ./db_stress
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5937
      
      Differential Revision: D17997647
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1a8d9994f89ce198935566756947c518f0052410
      e60cc092
  3. 18 10月, 2019 1 次提交
    • L
      Support decoding blob indexes in sst_dump (#5926) · fdc1cb43
      Levi Tamasi 提交于
      Summary:
      The patch adds a new command line parameter --decode_blob_index to sst_dump.
      If this switch is specified, sst_dump prints blob indexes in a human readable format,
      printing the blob file number, offset, size, and expiration (if applicable) for blob
      references, and the blob value (and expiration) for inlined blobs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926
      
      Test Plan:
      Used db_bench's BlobDB mode to generate SST files containing blob references with
      and without expiration, as well as inlined blobs with and without expiration (note: the
      latter are stored as plain values), and confirmed sst_dump correctly prints all four types
      of records.
      
      Differential Revision: D17939077
      
      Pulled By: ltamasi
      
      fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d
      fdc1cb43
  4. 17 10月, 2019 1 次提交
  5. 16 10月, 2019 1 次提交
  6. 15 10月, 2019 8 次提交
  7. 12 10月, 2019 2 次提交
    • M
      Fix SeekForPrev bug with Partitioned Filters and Prefix (#5907) · 4e729f90
      Maysam Yabandeh 提交于
      Summary:
      Partition Filters make use of a top-level index to find the partition that might have the bloom hash of the key. The index is with internal key format (before format version 3). Each partition contains the i) blooms of the keys in that range ii) bloom of prefixes of keys in that range, iii) the bloom of the prefix of the last key in the previous partition.
      When ::SeekForPrev(key), we first perform a prefix bloom test on the SST file. The partition however is identified using the full internal key, rather than the prefix key. The reason is to be compatible with the internal key format of the top-level index. This creates a corner case. Example:
      - SST k, Partition N: P1K1, P1K2
      - SST k, top-level index: P1K2
      - SST k+1, Partition 1: P2K1, P3K1
      - SST k+1 top-level index: P3K1
      When SeekForPrev(P1K3), it should point us to P1K2. However SST k top-level index would reject P1K3 since it is out of range.
      One possible fix would be to search with the prefix P1 (instead of full internal key P1K3) however the details of properly comparing prefix with full internal key might get complicated. The fix we apply in this PR is to look into the last partition anyway even if the key is out of range.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5907
      
      Differential Revision: D17889918
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 169fd7b3c71dbc08808eae5a8340611ebe5bdc1e
      4e729f90
    • A
      Fix block cache ID uniqueness for Windows builds (#5844) · b00761ee
      Andrew Kryczka 提交于
      Summary:
      Since we do not evict a file's blocks from block cache before that file
      is deleted, we require a file's cache ID prefix is both unique and
      non-reusable. However, the Windows functionality we were relying on only
      guaranteed uniqueness. That meant a newly created file could be assigned
      the same cache ID prefix as a deleted file. If the newly created file
      had block offsets matching the deleted file, full cache keys could be
      exactly the same, resulting in obsolete data blocks returned from cache
      when trying to read from the new file.
      
      We noticed this when running on FAT32 where compaction was writing out
      of order keys due to reading obsolete blocks from its input files. The
      functionality is documented as behaving the same on NTFS, although I
      wasn't able to repro it there.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844
      
      Test Plan:
      we had a reliable repro of out-of-order keys on FAT32 that
      was fixed by this change
      
      Differential Revision: D17752442
      
      fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
      b00761ee
  8. 11 10月, 2019 4 次提交
    • Y
      Revert "Enable partitioned index/filter in stress tests (#5895)" (#5904) · bc8b05cb
      Yanqin Jin 提交于
      Summary:
      This reverts commit 2f4e2881.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5904
      
      Differential Revision: D17871282
      
      Pulled By: riversand963
      
      fbshipit-source-id: d210725f8f3b26d8eac25892094da09d9694337e
      bc8b05cb
    • Y
      Remove a webhook due to potential security concern (#5902) · ddb62d1f
      Yanqin Jin 提交于
      Summary:
      As title.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5902
      
      Differential Revision: D17858150
      
      Pulled By: riversand963
      
      fbshipit-source-id: db2cd8a756faf7b9751b2651a22e1b29ca9fecec
      ddb62d1f
    • A
      Fix the rocksjava release Vagrant build on CentOS (#5901) · 1e9c8d42
      Adam Retter 提交于
      Summary:
      Closes https://github.com/facebook/rocksdb/issues/5873
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5901
      
      Differential Revision: D17869585
      
      fbshipit-source-id: 559472486f1d3ac80c0c7df6c421c4b612b9b7f9
      1e9c8d42
    • V
      MultiGet batching in memtable (#5818) · 4c49e38f
      Vijay Nadimpalli 提交于
      Summary:
      RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
      
      Test Plan:
      Existing tests
      
      Performance Test:
      Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
      
      TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
      
      Differential Revision: D17578869
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
      4c49e38f
  9. 10 10月, 2019 1 次提交
    • A
      Make the db_stress reopen loop in OperateDb() more robust (#5893) · 80ad996b
      anand76 提交于
      Summary:
      The loop in OperateDb() is getting quite complicated with the introduction of multiple key operations such as MultiGet and Reseeks. This is resulting in a number of corner cases that hangs db_stress due to synchronization problems during reopen (i.e when -reopen=<> option is specified). This PR makes it more robust by ensuring all db_stress threads vote to reopen the DB the exact same number of times.
      Most of the changes in this diff are due to indentation.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5893
      
      Test Plan: Run crash test
      
      Differential Revision: D17823827
      
      Pulled By: anand1976
      
      fbshipit-source-id: ec893829f611ac7cac4057c0d3d99f9ffb6a6dd9
      80ad996b
  10. 09 10月, 2019 5 次提交
  11. 08 10月, 2019 4 次提交
  12. 04 10月, 2019 3 次提交
    • A
      Fix data block upper bound checking for iterator reseek case (#5883) · 19a97dd1
      anand76 提交于
      Summary:
      When an iterator reseek happens with the user specifying a new iterate_upper_bound in ReadOptions, and the new seek position is at the end of the same data block, the Seek() ends up using a stale value of data_block_within_upper_bound_ and may return incorrect results.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5883
      
      Test Plan: Added a new test case DBIteratorTest.IterReseekNewUpperBound. Verified that it failed due to the assertion failure without the fix, and passes with the fix.
      
      Differential Revision: D17752740
      
      Pulled By: anand1976
      
      fbshipit-source-id: f9b635ff5d6aeb0e1bef102cf8b2f900efd378e3
      19a97dd1
    • P
      Fix type in shift operation in bloom_test (#5882) · 9f544465
      Peter Dillinger 提交于
      Summary:
      Broken type for shift in PR#5834. Fixing code means fixing
      expected values in test.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5882
      
      Test Plan: thisisthetest
      
      Differential Revision: D17746136
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d3c456ed30b433d55fcab6fc7d836940fe3b46b8
      9f544465
    • A
      Fix reopen voting logic in db_stress to prevent hangs (#5876) · cca87d77
      anand76 提交于
      Summary:
      When multiple operations are performed in a db_stress thread in one loop
      iteration, the reopen voting logic needs to take that into account. It
      was doing that for MultiGet, but a new option was introduced recently to
      do multiple iterator seeks per iteration, which broke it again. Fix the
      logic to be more robust and agnostic of the type of operation performed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5876
      
      Test Plan: Run db_stress
      
      Differential Revision: D17733590
      
      Pulled By: anand1976
      
      fbshipit-source-id: 787f01abefa1e83bba43e0b4f4abb26699b2089e
      cca87d77