1. 25 10月, 2019 1 次提交
    • D
      Add test showing range tombstones can create excessively large compactions (#5956) · 25095311
      Dan Lambright 提交于
      Summary:
      For more information on the original problem see this [link](https://github.com/facebook/rocksdb/issues/3977).
      
      This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction.
      
      1: T001 - T005
      2:  000 -  005
      
      In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration.
      
      1:  001 - 002
      1:  003 - 004
      1:  005
      2:  000 - 005
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5956
      
      Differential Revision: D18071631
      
      Pulled By: dlambrig
      
      fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
      25095311
  2. 24 10月, 2019 3 次提交
    • S
      CfConsistencyStressTest to validate key consistent across CFs in TestGet() (#5863) · 9f1e5a0b
      sdong 提交于
      Summary:
      Right now in CF consitency stres test's TestGet(), keys are just fetched without validation. With this change, in 1/2 the time, compare all the CFs share the same value with the same key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5863
      
      Test Plan: Run "make crash_test_with_atomic_flush" and see tests pass. Hack the code to generate some inconsistency and observe the test fails as expected.
      
      Differential Revision: D17934206
      
      fbshipit-source-id: 00ba1a130391f28785737b677f80f366fb83cced
      9f1e5a0b
    • P
      Remove unused BloomFilterPolicy::hash_func_ (#5961) · 6a32e3b5
      Peter Dillinger 提交于
      Summary:
      This is an internal, file-local "feature" that is not used and
      potentially confusing.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5961
      
      Test Plan: make check
      
      Differential Revision: D18099018
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7870627eeed09941d12538ec55d10d2e164fc716
      6a32e3b5
    • Y
      Make buckifier python3 compatible (#5922) · b4ebda7a
      Yanqin Jin 提交于
      Summary:
      Make buckifier/buckify_rocksdb.py run on both Python 3 and 2
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5922
      
      Test Plan:
      ```
      $python3 buckifier/buckify_rocksdb.py
      $python3 buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}'
      $python2 buckifier/buckify_rocksdb.py
      $python2 buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}'
      ```
      
      Differential Revision: D17920611
      
      Pulled By: riversand963
      
      fbshipit-source-id: cc6e2f36013a88a710d96098f6ca18cbe85e3f62
      b4ebda7a
  3. 23 10月, 2019 2 次提交
  4. 22 10月, 2019 8 次提交
  5. 19 10月, 2019 7 次提交
    • L
      Store the filter bits reader alongside the filter block contents (#5936) · 29ccf207
      Levi Tamasi 提交于
      Summary:
      Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
      only the filter block contents are stored in the block cache (as opposed to the
      earlier design where the cache stored the filter block reader itself, leading to
      potentially dangling pointers and concurrency bugs). However, this change
      introduced a performance hit since with the new code, the metadata fields are
      re-parsed upon every access. This patch reunites the block contents with the
      filter bits reader to eliminate this overhead; since this is still a self-contained
      pure data object, it is safe to store it in the cache. (Note: this is similar to how
      the zstd digest is handled.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936
      
      Test Plan:
      make asan_check
      
      filter_bench results for the old code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.7153
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.4258
        Single filter ns/op: 42.5974
        Random filter ns/op: 217.861
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.4217
        Single filter ns/op: 50.9855
        Random filter ns/op: 219.167
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5172
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 32.3556
        Single filter ns/op: 83.2239
        Random filter ns/op: 370.676
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.2265
        Single filter ns/op: 93.5651
        Random filter ns/op: 408.393
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      With the new code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 25.4285
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 31.0594
        Single filter ns/op: 43.8974
        Random filter ns/op: 226.075
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 31.0295
        Single filter ns/op: 50.3824
        Random filter ns/op: 226.805
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5308
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.2968
        Single filter ns/op: 58.6163
        Random filter ns/op: 291.434
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.1839
        Single filter ns/op: 66.9039
        Random filter ns/op: 292.828
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      Differential Revision: D17991712
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
      29ccf207
    • Y
      Fix TestIterate for HashSkipList in db_stress (#5942) · c53db172
      Yanqin Jin 提交于
      Summary:
      Since SeekForPrev (used by Prev) is not supported by HashSkipList when prefix is used, we disable it when stress testing HashSkipList.
      
      - Change the default memtablerep to skip list.
      - Avoid Prev() when memtablerep is HashSkipList and prefix is used.
      
      Test Plan (on devserver):
      ```
      $make db_stress
      $./db_stress -ops_per_thread=10000 -reopen=1 -destroy_db_initially=true -column_families=1 -threads=1 -column_families=1 -memtablerep=prefix_hash
      $# or simply
      $./db_stress
      $./db_stress -memtablerep=prefix_hash
      ```
      Results must print "Verification successful".
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5942
      
      Differential Revision: D18017062
      
      Pulled By: riversand963
      
      fbshipit-source-id: af867e59aa9e6f533143c984d7d529febf232fd7
      c53db172
    • P
      Refactor / clean up / optimize FullFilterBitsReader (#5941) · 5f8f2fda
      Peter Dillinger 提交于
      Summary:
      FullFilterBitsReader, after creating in BloomFilterPolicy, was
      responsible for decoding metadata bits. This meant that
      FullFilterBitsReader::MayMatch had some metadata checks in order to
      implement "always true" or "always false" functionality in the case
      of inconsistent or trivial metadata. This made for ugly
      mixing-of-concerns code and probably had some runtime cost. It also
      didn't really support plugging in alternative filter implementations
      with extensions to the existing metadata schema.
      
      BloomFilterPolicy::GetFilterBitsReader is now (exclusively) responsible
      for decoding filter metadata bits and constructing appropriate instances
      deriving from FilterBitsReader. "Always false" and "always true" derived
      classes allow FullFilterBitsReader not to be concerned with handling of
      trivial or inconsistent metadata. This also makes for easy expansion
      to alternative filter implementations in new, alternative derived
      classes. This change makes calls to FilterBitsReader::MayMatch
      *necessarily* virtual because there's now more than one built-in
      implementation. Compared with the previous implementation's extra
      'if' checks in MayMatch, there's no consistent performance difference,
      measured by (an older revision of) filter_bench (differences here seem
      to be within noise):
      
          Inside queries...
          -  Dry run (407) ns/op: 35.9996
          +  Dry run (407) ns/op: 35.2034
          -  Single filter ns/op: 47.5483
          +  Single filter ns/op: 47.4034
          -  Batched, prepared ns/op: 43.1559
          +  Batched, prepared ns/op: 42.2923
          ...
          -  Random filter ns/op: 150.697
          +  Random filter ns/op: 149.403
          ----------------------------
          Outside queries...
          -  Dry run (980) ns/op: 34.6114
          +  Dry run (980) ns/op: 34.0405
          -  Single filter ns/op: 56.8326
          +  Single filter ns/op: 55.8414
          -  Batched, prepared ns/op: 48.2346
          +  Batched, prepared ns/op: 47.5667
          -  Random filter ns/op: 155.377
          +  Random filter ns/op: 153.942
               Average FP rate %: 1.1386
      
      Also, the FullFilterBitsReader ctor was responsible for a surprising
      amount of CPU in production, due in part to inefficient determination of
      the CACHE_LINE_SIZE used to construct the filter being read. The
      overwhelming common case (same as my CACHE_LINE_SIZE) is now
      substantially optimized, as shown with filter_bench with
      -new_reader_every=1 (old option - see below) (repeatable result):
      
          Inside queries...
          -  Dry run (453) ns/op: 118.799
          +  Dry run (453) ns/op: 105.869
          -  Single filter ns/op: 82.5831
          +  Single filter ns/op: 74.2509
          ...
          -  Random filter ns/op: 224.936
          +  Random filter ns/op: 194.833
          ----------------------------
          Outside queries...
          -  Dry run (aa1) ns/op: 118.503
          +  Dry run (aa1) ns/op: 104.925
          -  Single filter ns/op: 90.3023
          +  Single filter ns/op: 83.425
          ...
          -  Random filter ns/op: 220.455
          +  Random filter ns/op: 175.7
               Average FP rate %: 1.13886
      
      However PR#5936 has/will reclaim most of this cost. After that PR, the optimization of this code path is likely negligible, but nonetheless it's clear we aren't making performance any worse.
      
      Also fixed inadequate check of consistency between filter data size and
      num_lines. (Unit test updated.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5941
      
      Test Plan:
      previously added unit tests FullBloomTest.CorruptFilters and
      FullBloomTest.RawSchema
      
      Differential Revision: D18018353
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 8e04c2b4a7d93223f49a237fd52ef2483929ed9c
      5f8f2fda
    • P
      Fix PlainTableReader not to crash sst_dump (#5940) · fe464bca
      Peter Dillinger 提交于
      Summary:
      Plain table SSTs could crash sst_dump because of a bug in
      PlainTableReader that can leave table_properties_ as null. Even if it
      was intended not to keep the table properties in some cases, they were
      leaked on the offending code path.
      
      Steps to reproduce:
      
          $ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
          $ sst_dump --file=0000xx.sst --show_properties
          from [] to []
          Process /dev/shm/dbbench/000014.sst
          Sst file format: plain table
          Raw user collected properties
          ------------------------------
          Segmentation fault (core dumped)
      
      Also added missing unit testing of plain table full_scan_mode, and
      an assertion in NewIterator to check for regression.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940
      
      Test Plan: new unit test, manual, make check
      
      Differential Revision: D18018145
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
      fe464bca
    • Z
      Enable trace_replay with multi-threads (#5934) · 526e3b97
      Zhichao Cao 提交于
      Summary:
      In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934
      
      Test Plan: test with make check and real trace replay.
      
      Differential Revision: D17998098
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8
      526e3b97
    • L
      Update HISTORY.md with recent BlobDB adjacent changes · 69bd8a28
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5939
      
      Differential Revision: D18009096
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 032a48a302f9da38aecf4055b5a8d4e1dffd9dc7
      69bd8a28
    • Y
      Expose db stress tests (#5937) · e60cc092
      Yanqin Jin 提交于
      Summary:
      expose db stress test by providing db_stress_tool.h in public header.
      This PR does the following:
      - adds a new header, db_stress_tool.h, in include/rocksdb/
      - renames db_stress.cc to db_stress_tool.cc
      - adds a db_stress.cc which simply invokes a test function.
      - update Makefile accordingly.
      
      Test Plan (dev server):
      ```
      make db_stress
      ./db_stress
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5937
      
      Differential Revision: D17997647
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1a8d9994f89ce198935566756947c518f0052410
      e60cc092
  6. 18 10月, 2019 1 次提交
    • L
      Support decoding blob indexes in sst_dump (#5926) · fdc1cb43
      Levi Tamasi 提交于
      Summary:
      The patch adds a new command line parameter --decode_blob_index to sst_dump.
      If this switch is specified, sst_dump prints blob indexes in a human readable format,
      printing the blob file number, offset, size, and expiration (if applicable) for blob
      references, and the blob value (and expiration) for inlined blobs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926
      
      Test Plan:
      Used db_bench's BlobDB mode to generate SST files containing blob references with
      and without expiration, as well as inlined blobs with and without expiration (note: the
      latter are stored as plain values), and confirmed sst_dump correctly prints all four types
      of records.
      
      Differential Revision: D17939077
      
      Pulled By: ltamasi
      
      fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d
      fdc1cb43
  7. 17 10月, 2019 1 次提交
  8. 16 10月, 2019 1 次提交
  9. 15 10月, 2019 8 次提交
  10. 12 10月, 2019 2 次提交
    • M
      Fix SeekForPrev bug with Partitioned Filters and Prefix (#5907) · 4e729f90
      Maysam Yabandeh 提交于
      Summary:
      Partition Filters make use of a top-level index to find the partition that might have the bloom hash of the key. The index is with internal key format (before format version 3). Each partition contains the i) blooms of the keys in that range ii) bloom of prefixes of keys in that range, iii) the bloom of the prefix of the last key in the previous partition.
      When ::SeekForPrev(key), we first perform a prefix bloom test on the SST file. The partition however is identified using the full internal key, rather than the prefix key. The reason is to be compatible with the internal key format of the top-level index. This creates a corner case. Example:
      - SST k, Partition N: P1K1, P1K2
      - SST k, top-level index: P1K2
      - SST k+1, Partition 1: P2K1, P3K1
      - SST k+1 top-level index: P3K1
      When SeekForPrev(P1K3), it should point us to P1K2. However SST k top-level index would reject P1K3 since it is out of range.
      One possible fix would be to search with the prefix P1 (instead of full internal key P1K3) however the details of properly comparing prefix with full internal key might get complicated. The fix we apply in this PR is to look into the last partition anyway even if the key is out of range.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5907
      
      Differential Revision: D17889918
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 169fd7b3c71dbc08808eae5a8340611ebe5bdc1e
      4e729f90
    • A
      Fix block cache ID uniqueness for Windows builds (#5844) · b00761ee
      Andrew Kryczka 提交于
      Summary:
      Since we do not evict a file's blocks from block cache before that file
      is deleted, we require a file's cache ID prefix is both unique and
      non-reusable. However, the Windows functionality we were relying on only
      guaranteed uniqueness. That meant a newly created file could be assigned
      the same cache ID prefix as a deleted file. If the newly created file
      had block offsets matching the deleted file, full cache keys could be
      exactly the same, resulting in obsolete data blocks returned from cache
      when trying to read from the new file.
      
      We noticed this when running on FAT32 where compaction was writing out
      of order keys due to reading obsolete blocks from its input files. The
      functionality is documented as behaving the same on NTFS, although I
      wasn't able to repro it there.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844
      
      Test Plan:
      we had a reliable repro of out-of-order keys on FAT32 that
      was fixed by this change
      
      Differential Revision: D17752442
      
      fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
      b00761ee
  11. 11 10月, 2019 4 次提交
    • Y
      Revert "Enable partitioned index/filter in stress tests (#5895)" (#5904) · bc8b05cb
      Yanqin Jin 提交于
      Summary:
      This reverts commit 2f4e2881.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5904
      
      Differential Revision: D17871282
      
      Pulled By: riversand963
      
      fbshipit-source-id: d210725f8f3b26d8eac25892094da09d9694337e
      bc8b05cb
    • Y
      Remove a webhook due to potential security concern (#5902) · ddb62d1f
      Yanqin Jin 提交于
      Summary:
      As title.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5902
      
      Differential Revision: D17858150
      
      Pulled By: riversand963
      
      fbshipit-source-id: db2cd8a756faf7b9751b2651a22e1b29ca9fecec
      ddb62d1f
    • A
      Fix the rocksjava release Vagrant build on CentOS (#5901) · 1e9c8d42
      Adam Retter 提交于
      Summary:
      Closes https://github.com/facebook/rocksdb/issues/5873
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5901
      
      Differential Revision: D17869585
      
      fbshipit-source-id: 559472486f1d3ac80c0c7df6c421c4b612b9b7f9
      1e9c8d42
    • V
      MultiGet batching in memtable (#5818) · 4c49e38f
      Vijay Nadimpalli 提交于
      Summary:
      RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
      
      Test Plan:
      Existing tests
      
      Performance Test:
      Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
      
      TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
      
      Differential Revision: D17578869
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
      4c49e38f
  12. 10 10月, 2019 1 次提交
    • A
      Make the db_stress reopen loop in OperateDb() more robust (#5893) · 80ad996b
      anand76 提交于
      Summary:
      The loop in OperateDb() is getting quite complicated with the introduction of multiple key operations such as MultiGet and Reseeks. This is resulting in a number of corner cases that hangs db_stress due to synchronization problems during reopen (i.e when -reopen=<> option is specified). This PR makes it more robust by ensuring all db_stress threads vote to reopen the DB the exact same number of times.
      Most of the changes in this diff are due to indentation.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5893
      
      Test Plan: Run crash test
      
      Differential Revision: D17823827
      
      Pulled By: anand1976
      
      fbshipit-source-id: ec893829f611ac7cac4057c0d3d99f9ffb6a6dd9
      80ad996b
  13. 09 10月, 2019 1 次提交