1. 20 5月, 2023 1 次提交
  2. 18 5月, 2023 1 次提交
    • P
      Change internal headers with duplicate names (#11408) · 206fdea3
      Peter Dillinger 提交于
      Summary:
      In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.
      
      No public API changes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408
      
      Test Plan: existing tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D45456696
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
      206fdea3
  3. 10 5月, 2023 1 次提交
    • P
      Add hash_seed to Caches (#11391) · f4a02f2c
      Peter Dillinger 提交于
      Summary:
      See motivation and description in new ShardedCacheOptions::hash_seed option.
      
      Updated db_bench so that its seed param is used for the cache hash seed.
      Made its code more safe to ensure seed is set before use.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391
      
      Test Plan:
      unit tests added / updated
      
      **Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache.
      
      Reviewed By: hx235
      
      Differential Revision: D45557797
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934
      f4a02f2c
  4. 04 5月, 2023 1 次提交
    • C
      remove redundant move (#11418) · 50b33ebb
      clundro 提交于
      Summary:
      when I use g++-13 to exec the `make all` command,  the output throws the warnings.
      ```
      db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
      db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
        376 |           env_, GenerateFileName(file_number), std::move(contents)));
            |                                                ~~~~~~~~~^~~~~~~~~~
      db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
        375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
            |       ^~~~~~~~~
      db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
        376 |           env_, GenerateFileName(file_number), std::move(contents)));
            |                                                ~~~~~~~~~^~~~~~~~~~
      db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
        375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
            |       ^~~~~~~~~
      cc1plus: all warnings being treated as errors
      make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
      ```
      
      and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D45528223
      
      Pulled By: ajkr
      
      fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
      50b33ebb
  5. 26 4月, 2023 1 次提交
    • C
      Block per key-value checksum (#11287) · 62fc15f0
      Changyu Bi 提交于
      Summary:
      add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
      1. checksum construction and verification in block.cc/h
      2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
      3. unit tests/crash test updates
      
      Tests:
      * Added unit tests
      * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
      
      Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
      
      Performance:
      Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
      For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
      
      ```
      SETUP
      make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
      ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
      
      BENCHMARK
      ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
      
      The readrandom ops/sec looks like the following:
      Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
      Main              240805     223604         198176           161653       139040
      PR prot_bytes=0   238691     226693         200127           161082       141153
      PR prot_bytes=1   214983     193199         178532           137013       108211
      prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
      prot_bytes=0
      ```
      
      The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
      
      Reviewed By: ajkr
      
      Differential Revision: D43970708
      
      Pulled By: cbi42
      
      fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
      62fc15f0
  6. 25 4月, 2023 1 次提交
  7. 22 4月, 2023 1 次提交
    • H
      Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) · 151242ce
      Hui Xiao 提交于
      Summary:
      **Context:**
      The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
      
      **Summary**
      - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
         - Fixed the default histogram in `RandomAccessFileReader`
      - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
      - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
      
      Test Plan:
      - **Stress test**
      - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
           - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
      ```
      ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
      ```
      ```
      // BlockBasedTable
      rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
      rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
      rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
      
      // PlainTable
      Does not apply
      ```
      - **Db bench 2: performance**
      
      **Read**
      
      SETUP: db with 900 files
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
      ```run till convergence
      ```
      ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
      ```
      Pre-change
      `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
      Post-change (no regression, -0.3%)
      `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
      
      **Compaction/Flush**run till convergence
      ```
      ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
      
      rocksdb.sst.read.micros  COUNT : 33820
      rocksdb.sst.read.flush.micros COUNT : 1800
      rocksdb.sst.read.compaction.micros COUNT : 32020
      ```
      Pre-change
      `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Post-change (no regression, ~-0.4%)
      `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Reviewed By: ajkr
      
      Differential Revision: D44007011
      
      Pulled By: hx235
      
      fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
      151242ce
  8. 05 4月, 2023 1 次提交
    • P
      Change default block cache from 8MB to 32MB (#11350) · 3c17930e
      Peter Dillinger 提交于
      Summary:
      ... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.
      
      Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.
      
      The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350
      
      Test Plan: unit tests updated
      
      Reviewed By: cbi42
      
      Differential Revision: D44674873
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
      3c17930e
  9. 30 3月, 2023 1 次提交
  10. 28 3月, 2023 1 次提交
    • C
      Trivially move files down when opening db with level_compaction_dynamic_l… (#11321) · 60132016
      Changyu Bi 提交于
      Summary:
      …evel_bytes
      
       During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true.  Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes.
      
      Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321
      
      Test Plan:
      1. Added unit tests.
      2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB.
      
      TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921.
      
      Reviewed By: ajkr
      
      Differential Revision: D44341930
      
      Pulled By: cbi42
      
      fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2
      60132016
  11. 19 3月, 2023 1 次提交
  12. 18 3月, 2023 2 次提交
    • P
      HyperClockCache support for SecondaryCache, with refactoring (#11301) · 204fcff7
      Peter Dillinger 提交于
      Summary:
      Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
      
      This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
      
      It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
      
      ## cache.h (public API)
      Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
      
      ## advanced_cache.h (advanced public API)
      * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
      * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
      
      These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
      
      * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
      
      I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
      
      ## cacheable_entry.h
      A couple of functions are obsolete because Cache::Handle can no longer be pending.
      
      ## cache.cc
      Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
      
      ## secondary_cache_adapter.{h,cc}
      The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
      
      ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
      Simply updated for Cache API changes.
      
      ## lru_cache.{h,cc}
      Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
      
      ## clock_cache.{h,cc}
      Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
      
      ## block_based_table_reader*
      Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
      
      Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
      
      Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
      
      ## Intended follow-up work
      * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
      * Stacked secondary caches (see above discussion)
      * See if we can make up for the small MultiGet performance regression.
      * Study more performance with SecondaryCache
      * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
      * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
      * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
      
      Test Plan:
      ## Unit tests
      Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
      
      ## Crash/stress test
      Updated to use the new combination.
      
      ## Performance
      First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
      
      ```
      (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
      ```
      
      **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
      HyperClockCache: 3168662 (average parallel ops/sec)
      LRUCache: 2940127
      
      **After** this and #11299, running for about an hour:
      HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
      LRUCache: 2940928 (0.03% faster)
      
      This is an acceptable difference IMHO.
      
      Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
      
      Create DB and test (before and after tests running simultaneously):
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec
      
      So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec
      
      Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
      
      Let's also look at Get() in db_bench:
      
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
      readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
      readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec
      
      That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
      
      **Before**:
      readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
      readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
      readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D44177044
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
      204fcff7
    • L
      Increase the stress test coverage of GetEntity (#11303) · a72d55c9
      Levi Tamasi 提交于
      Summary:
      The `GetEntity` API is currently used in the stress tests for verification purposes;
      this patch extends the coverage by adding a mode where all point lookups in
      the non-batched, batched, and CF consistency stress tests are done using this API.
      The PR also includes a bit of refactoring to eliminate some boilerplate code around
      the wide-column consistency checks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11303
      
      Test Plan: Ran stress tests of the batched, non-batched, and CF consistency varieties.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44148503
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fecdbfd3e65a459bbf16ab7aa7b9173e19240077
      a72d55c9
  13. 10 3月, 2023 1 次提交
  14. 08 3月, 2023 1 次提交
  15. 07 3月, 2023 1 次提交
    • A
      Add support for parameters setting related to async_io benchmarks (#11262) · 13357de0
      akankshamahajan 提交于
      Summary:
      Provide support in benchmark regression to use different options to be used in async_io benchamark only - "$`MAX_READAHEAD_SIZE`", $`INITIAL_READAHEAD_SIZE`", "$`NUM_READS_FOR_READAHEAD_SIZE`".
      If user wants to run set these parameters for all benchmarks then these parameters need to be set in OPTION file instead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11262
      
      Test Plan: Ran manually
      
      Reviewed By: anand1976
      
      Differential Revision: D43725567
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 28c3462dd785ffd646d44560fa9c92bc6a8066e5
      13357de0
  16. 15 2月, 2023 1 次提交
  17. 11 2月, 2023 1 次提交
  18. 10 2月, 2023 1 次提交
    • A
      Extend existing benchmarks seekrandom and multiread to run with async_io (#11170) · ab2157fa
      akankshamahajan 提交于
      Summary:
      =======================================================================
      Benchmark seekrandom_asyncio
      
      =======================================================================
      
      db_bench_cmd=$(which time) -p ./db_bench       --benchmarks=seekrandom
      --db=/tmp/rocksdb/regression_test/db --wal_dir=
      --use_existing_db=0       --perf_level=1
      --disable_auto_compactions       --threads=1       --num=1073741824
      --reads=1073741824       --writes=1073741824       --deletes=1073741824
      --key_size=100       --value_size=900       --cache_size=1073741824
      --statistics=0              --compression_ratio=0.5       --histogram=1
      --seek_nexts=10       --stats_per_interval=1
      --stats_interval_seconds=600       --max_background_flushes=4
      --num_multi_db=1       --max_background_compactions=16
      --num_high_pri_threads=4       --num_low_pri_threads=16
      --seed=1675181789       --multiread_batched=true       --batch_size=128
      --multiread_stride=12       --async_io=true
      --optimize_multiget_for_io=false 2>&1
      RocksDB:    version 8.0.0
      
      =======================================================================
       Benchmark multireadrandom_asyncio
      
      ====================================================================
      
      db_bench_cmd=$(which time) -p ./db_bench
      --benchmarks=multireadrandom --db=/tmp/rocksdb/regression_test/db
      --wal_dir=       --use_existing_db=0       --perf_level=1
      --disable_auto_compactions       --threads=1       --num=1073741824
      --reads=1073741824       --writes=1073741824       --deletes=1073741824
      --key_size=100       --value_size=900       --cache_size=1073741824
      --statistics=0              --compression_ratio=0.5       --histogram=1
      --seek_nexts=10       --stats_per_interval=1
      --stats_interval_seconds=600       --max_background_flushes=4
      --num_multi_db=1       --max_background_compactions=16
      --num_high_pri_threads=4       --num_low_pri_threads=16
      --seed=1675181841       --multiread_batched=true       --batch_size=128
      --multiread_stride=12       --async_io=true
      --optimize_multiget_for_io=true 2>&1
      RocksDB:    version 8.0.0
      Date:       Tue Jan 31 08:17:22 2023
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11170
      
      Reviewed By: ajkr, anand1976
      
      Differential Revision: D42889107
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: b819be2bd5f00d1db654b9e829b84f11e6bcab92
      ab2157fa
  19. 07 2月, 2023 1 次提交
  20. 03 2月, 2023 3 次提交
    • Y
      Enable crash test for user-defined timestamp and BlobDB combination (#11163) · 701a19cc
      Yu Zhang 提交于
      Summary:
      Enable the set of crash test for when user defined timestamp is enabled in combination with BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11163
      
      Test Plan: `make check` and `db_stress`/`db_crashtest.py` with various combinations.
      
      Reviewed By: ltamasi
      
      Differential Revision: D42906457
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 6bec6449a4213b536c787420ff30a7d17b676deb
      701a19cc
    • C
      Remove NUMA setting for benchmark-linux (#11180) · fec5c8de
      changyubi 提交于
      Summary:
      benchmark-linux is failing on main branch after https://github.com/facebook/rocksdb/issues/11074 with the following error msg:
      ```
      /usr/bin/time -f '%e %U %S' -o /tmp/benchmark-results/8.0.0/benchmark_overwriteandwait.t1.s0.log.time numactl --interleave=all timeout 1200 ./db_bench --benchmarks=overwrite,waitforcompaction,stats --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=4 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_blob_cache,multiread_batched,cache_low_pri_pool_ratio,prepopulate_block_cache --db=/tmp/rocksdb-benchmark-datadir --wal_dir=/tmp/rocksdb-benchmark-datadir --num=20000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=10737418240 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=1048576 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --cache_low_pri_pool_ratio=0 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --duration=600 --threads=1 --merge_operator="put" --seed=1675372532 --report_file=/tmp/benchmark-results/8.0.0/benchmark_overwriteandwait.t1.s0.log.r.csv 2>&1 | tee -a /tmp/benchmark-results/8.0.0/benchmark_overwriteandwait.t1.s0.log
      /usr/bin/time: cannot run numactl: No such file or directory
      ```
      This PR removes the newly added NUMA setting.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11180
      
      Test Plan: check next main branch run for benchmark-linux
      
      Reviewed By: ajkr
      
      Differential Revision: D42975930
      
      Pulled By: cbi42
      
      fbshipit-source-id: f084d39aeba9877c0752502e879c5e612b507653
      fec5c8de
    • A
      CI Benchmarking. Small configuration changes based on performance analysis. (#11074) · 6781009e
      Alan Paxton 提交于
      Summary:
      First, we made a small reduction in DURATION_RW as runs were exceeding 1 hour and colliding with subsequent runs.
      
      See Mark Callaghan’s blog post at http://smalldatum.blogspot.com/2023/01/variance-in-rocksdb-benchmarks-on-cloud.html
      
      Configuration parameters which are not consistent with the following email from Mark (see the blog post for more context) have been updated. Where Mark has defined the parameter and we haven't, we define it explicitly. We will need to further monitor for an expected reduction in variance of test times:
      
      To match what I did:
       ---
      
      nsecs=1800
      dbdir=/data/m/rx
      resultdir=bm.lc.nt1.cm1.d0
      
      env WRITE_BUFFER_SIZE_MB=16 TARGET_FILE_SIZE_BASE_MB=16 MAX_BYTES_FOR_LEVEL_BASE_MB=64 MAX_BACKGROUND_JOBS=4 NUM_KEYS=20000000 CACHE_SIZE_MB=10240 DURATION_RW=$nsecs DURATION_RO=$nsecs MB_WRITE_PER_SEC=2 NUM_THREADS=1 COMPRESSION_TYPE=none CACHE_INDEX_AND_FILTER_BLOCKS=1 VALUE_SIZE=400 NUMA=1 MIN_LEVEL_TO_COMPRESS=3 COMPACTION_STYLE=leveled bash benchmark_compare.sh $dbdir $resultdir 7.8.fb
      
      env WRITE_BUFFER_SIZE_MB=16 TARGET_FILE_SIZE_BASE_MB=16 MAX_BYTES_FOR_LEVEL_BASE_MB=64 MAX_BACKGROUND_JOBS=4 NUM_KEYS=200000000 CACHE_SIZE_MB=10240 DURATION_RW=$nsecs DURATION_RO=$nsecs MB_WRITE_PER_SEC=2 NUM_THREADS=1 COMPRESSION_TYPE=lz4 CACHE_INDEX_AND_FILTER_BLOCKS=1 VALUE_SIZE=400 NUMA=1 MIN_LEVEL_TO_COMPRESS=3 COMPACTION_STYLE=leveled bash benchmark_compare.sh $dbdir $resultdir 7.8.fb
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11074
      
      Reviewed By: ajkr
      
      Differential Revision: D42969668
      
      Pulled By: cbi42
      
      fbshipit-source-id: 1ea4e6a3901be4016108f93817eb58f74baac21a
      6781009e
  21. 31 1月, 2023 1 次提交
    • P
      Cleanup, improve, stress test LockWAL() (#11143) · 94e3beec
      Peter Dillinger 提交于
      Summary:
      The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
      
      * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
      * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
      * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
      * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
      * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
      * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
      
      Test Plan: unit tests added / updated, added to stress/crash test
      
      Reviewed By: ajkr
      
      Differential Revision: D42848627
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
      94e3beec
  22. 28 1月, 2023 2 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
    • Y
      Remove deprecated util functions in options_util.h (#11126) · 6943ff6e
      Yu Zhang 提交于
      Summary:
      Remove the util functions in options_util.h that have previously been marked deprecated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126
      
      Test Plan: `make check`
      
      Reviewed By: ltamasi
      
      Differential Revision: D42757496
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
      6943ff6e
  23. 26 1月, 2023 1 次提交
    • A
      Support PutEntity in trace analyzer (#11127) · 6a5071ce
      Andrew Kryczka 提交于
      Summary:
      Add the most basic support such that trace_analyzer commands no longer fail with
      ```
      Cannot process the write batch in the trace
      Cannot process the TraceRecord
      PutEntityCF not implemented
      Cannot process the trace
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11127
      
      Reviewed By: cbi42
      
      Differential Revision: D42732319
      
      Pulled By: ajkr
      
      fbshipit-source-id: 162d8a31318672a46539b1b042ec25f69b25c4ed
      6a5071ce
  24. 25 1月, 2023 1 次提交
  25. 24 1月, 2023 1 次提交
  26. 19 1月, 2023 1 次提交
  27. 14 1月, 2023 1 次提交
    • L
      db_bench: let -benchmark=compact respect -subcompactions (#11077) · 3941c349
      leipeng 提交于
      Summary:
      When running `-benchmarks=compact`, `-subcompactions` does not take effect.
      
      `-subcompactions` option comment says it is for L0-L1 compactions, it is natural to extend it to CompactionRangeOptions.max_subcompactions.
      
      This PR set CompactionRangeOptions.max_subcompactions = FLAGS_subcompactions
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11077
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42506251
      
      Pulled By: ajkr
      
      fbshipit-source-id: f77c9a99d32ff7af59f3c452c9e16aaeb0360304
      3941c349
  28. 04 1月, 2023 1 次提交
    • H
      Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063) · b965a5a8
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR.
      
      It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false`
      
      ```
      --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0  --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65
      ```
      
      Therefore this PR is adding it back to stress/crash test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063
      
      Test Plan: Rehearsal stress test to make sure stress/crash test is stable
      
      Reviewed By: ajkr
      
      Differential Revision: D42283650
      
      Pulled By: hx235
      
      fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef
      b965a5a8
  29. 30 11月, 2022 1 次提交
  30. 22 11月, 2022 1 次提交
  31. 17 11月, 2022 1 次提交
  32. 05 11月, 2022 1 次提交
  33. 28 10月, 2022 1 次提交
  34. 26 10月, 2022 2 次提交
    • S
      Run clang format against files under tools/ and db_stress_tool/ (#10868) · 48fe9217
      sdong 提交于
      Summary:
      Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868
      
      Test Plan: Watch existing CI to pass
      
      Reviewed By: ajkr
      
      Differential Revision: D40683485
      
      fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b
      48fe9217
    • H
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos... · fc74abb4
      Hui Xiao 提交于
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
      
      Summary:
      **Context:**
      Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
      Repro:
      ```
      COERCE_CONTEXT_SWICH=1 make -j56 db_stress
      
      ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
      
      put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
      ```
      
      **Summary:**
      FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
      -  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
         - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
         - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
      - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
        - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
        - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
      - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
      - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
          - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
      
      Some additional clean-up included in this PR:
      - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
      - Added comment about `earliest_memtable_seqno` in related APIs
      - Made parameter `earliest_memtable_seqno` constant and required
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
      
      Test Plan:
      - make check
      - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
      - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only
      
      Reviewed By: ajkr
      
      Differential Revision: D40090485
      
      Pulled By: hx235
      
      fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
      fc74abb4
  35. 24 10月, 2022 1 次提交