1. 06 9月, 2023 1 次提交
    • A
      Added compaction read errors to `db_stress` (#11789) · 392d6957
      Andrew Kryczka 提交于
      Summary:
      - Fixed misspellings of "inject"
      - Made user read errors retryable when `FLAGS_inject_error_severity == 1`
      - Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes
      - Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789
      
      Test Plan:
      With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug:
      - Build: `make -j56 db_stress`
      - Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000`
      - Output:
      ```
      stderr has error message:
      ***put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.***
      ```
      
      Reviewed By: cbi42
      
      Differential Revision: D48939994
      
      Pulled By: ajkr
      
      fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb
      392d6957
  2. 25 8月, 2023 1 次提交
  3. 24 8月, 2023 1 次提交
    • F
      Run db_stress for final time to ensure un-interrupted validation (#11592) · bc448e9c
      Fuat Basik 提交于
      Summary:
      In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592
      
      Reviewed By: cbi42
      
      Differential Revision: D48003998
      
      Pulled By: hx235
      
      fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
      bc448e9c
  4. 23 8月, 2023 1 次提交
    • C
      Do not drop unsynced data during reopen in stress test (#11731) · 5e0584bd
      Changyu Bi 提交于
      Summary:
      Currently the stress test does not support restoring expected state (to a specific sequence number) when there is unsynced data loss during the reopen phase. This causes a few internal stress test failure with errors like inconsistent value. This PR disables dropping unsynced data during reopen to avoid failures due to this issue. We can re-enable later after we decide to support unsynced data loss during DB reopen in stress test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11731
      
      Test Plan:
      * Running this test a few times can fail for inconsistent value before this change
      ```
      ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=20.57166126835524 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=6 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=1 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=5 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=0 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=3 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```
      
      Reviewed By: hx235
      
      Differential Revision: D48537494
      
      Pulled By: cbi42
      
      fbshipit-source-id: ddae21b9bb6ee8d67229121f58513e95f7ef6d8d
      5e0584bd
  5. 19 8月, 2023 1 次提交
    • C
      Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666) · c2aad555
      Changyu Bi 提交于
      Summary:
      Optionally enable zstd checksum flag (https://github.com/facebook/zstd/blob/d857369028d997c92ff1f1861a4d7f679a125464/lib/zstd.h#L428) to detect corruption during decompression. Main changes are in compression.h:
      * User can set CompressionOptions::checksum to true to enable this feature.
      * We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
      * Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
      * Error handling during decompression to propagate error message from ZSTD.
      * Updated microbench to test read performance impact.
      
      About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666
      
      Test Plan:
      * Existing unit tests for zstd compression
      * Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
      * Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
      * Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
      * Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
      * Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
      * Microbench:
      ```
      TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100
      
      Min out of 100 runs:
      Main:
      10390 10436 10456 10484 10499 10535 10544 10545 10565 10568
      
      After this PR, checksum=false
      10285 10397 10503 10508 10515 10557 10562 10635 10640 10660
      
      After this PR, checksum=true
      10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
      ```
      * db_bench:
      ```
      Write perf
      TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..
      
      [FillSeq checksum=0]
      fillseq [AVG    10 runs] : 281635 (± 31711) ops/sec;   31.2 (± 3.5) MB/sec
      fillseq [MEDIAN 10 runs] : 294027 ops/sec;   32.5 MB/sec
      
      [FillSeq checksum=1]
      fillseq [AVG    10 runs] : 286961 (± 34700) ops/sec;   31.7 (± 3.8) MB/sec
      fillseq [MEDIAN 10 runs] : 283278 ops/sec;   31.3 MB/sec
      
      Read perf
      TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1
      
      [Readrandom checksum=1]
      readrandom [AVG    20 runs] : 360928 (± 3579) ops/sec;    4.0 (± 0.0) MB/sec
      readrandom [MEDIAN 20 runs] : 362468 ops/sec;    4.0 MB/sec
      
      [Readrandom checksum=0]
      readrandom [AVG    20 runs] : 380365 (± 2384) ops/sec;    4.2 (± 0.0) MB/sec
      readrandom [MEDIAN 20 runs] : 379800 ops/sec;    4.2 MB/sec
      
      Compression
      TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1
      
      checksum=1
      compress [AVG    20 runs] : 54074 (± 634) ops/sec;  211.2 (± 2.5) MB/sec
      compress [MEDIAN 20 runs] : 54396 ops/sec;  212.5 MB/sec
      
      checksum=0
      compress [AVG    20 runs] : 54598 (± 393) ops/sec;  213.3 (± 1.5) MB/sec
      compress [MEDIAN 20 runs] : 54592 ops/sec;  213.3 MB/sec
      
      Decompression:
      TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1
      
      checksum = 0
      uncompress [AVG    20 runs] : 167499 (± 962) ops/sec;  654.3 (± 3.8) MB/sec
      uncompress [MEDIAN 20 runs] : 167210 ops/sec;  653.2 MB/sec
      checksum = 1
      uncompress [AVG    20 runs] : 167980 (± 924) ops/sec;  656.2 (± 3.6) MB/sec
      uncompress [MEDIAN 20 runs] : 168465 ops/sec;  658.1 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D48019378
      
      Pulled By: cbi42
      
      fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
      c2aad555
  6. 17 8月, 2023 1 次提交
    • C
      Delay bottommost level single file compactions (#11701) · d1ff4014
      Changyu Bi 提交于
      Summary:
      For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.
      
      * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701
      
      Test Plan:
      * Add two unit tests to test when bottommost_file_compaction_delay > 0.
      * Ran crash test with the new option.
      
      Reviewed By: jaykorean, ajkr
      
      Differential Revision: D48331564
      
      Pulled By: cbi42
      
      fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
      d1ff4014
  7. 12 8月, 2023 1 次提交
    • P
      Placeholder for AutoHyperClockCache, more (#11692) · ef6f0255
      Peter Dillinger 提交于
      Summary:
      * The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
      * Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
      * Simplify some existing tests not to depend directly on cache type.
      * Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
      * Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
      * Add another templated class BaseHyperClockCache to reduce future copy-paste
      * Added ReportProblems support to cache_bench
      * Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
      ```
      2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
      ```
      
      In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692
      
      Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D48247413
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
      ef6f0255
  8. 11 8月, 2023 1 次提交
    • P
      Adjust db_stress handling of TryAgain from optimistic txn (#11691) · a85eccc6
      Peter Dillinger 提交于
      Summary:
      We're still getting some rare cases of 5x TryAgains in a row. Here I'm boosting the failure threshold to 10 in a row and adding more info in the output, to help us manually verify whether there's anything suspicous about the sequence of TryAgains, such as if Rollback failed to reset to new sequence numbers.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11691
      
      Test Plan: By lowering the threshold to 2 and adjusting some other db_crashtest parameters, I was able to hit my new code and saw fresh sequence number on the subsequent TryAgain.
      
      Reviewed By: cbi42
      
      Differential Revision: D48236153
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c0530e969ddcf8de7348e5cf7daf5d6d5dec24f4
      a85eccc6
  9. 09 8月, 2023 2 次提交
    • H
      Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444) · 9a034801
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      - Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
         - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
         - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
      
      - Misc
         - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
         - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
      
      Test Plan:
      - CI fake db crash/stress test
      - Microbenchmarking
      
      **Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
      - google benchmark version: https://github.com/google/benchmark/commit/604f6fd3f4b34a84ec4eb4db81d842fa4db829cd
      - db_basic_bench_base: upstream
      - db_basic_bench_pr: db_basic_bench_base + this PR
      - asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
      - asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR
      
      **Test**
      
      Get
      ```
      TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
      ```
      
      Result
      ```
      Coming soon
      ```
      
      AsyncRead
      ```
      TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
      ```
      
      Result
      ```
      Base:
      1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,
      
      PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
      1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D45918925
      
      Pulled By: hx235
      
      fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
      9a034801
    • Y
      Log user_defined_timestamps_persisted flag in event logger (#11683) · 9c2ebcc2
      Yu Zhang 提交于
      Summary:
      As titled, and also removed an undefined and unused member function in for ColumnFamilyData
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683
      
      Reviewed By: ajkr
      
      Differential Revision: D48156290
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
      9c2ebcc2
  10. 08 8月, 2023 1 次提交
    • P
      Prepare tests for new HCC naming (#11676) · 99daea34
      Peter Dillinger 提交于
      Summary:
      I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.
      
      This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676
      
      Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache
      
      Reviewed By: ajkr
      
      Differential Revision: D48103165
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
      99daea34
  11. 03 8月, 2023 1 次提交
    • V
      Add an option to trigger flush when the number of range deletions reach a threshold (#11358) · 87a21d08
      Vardhan 提交于
      Summary:
      Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358
      
      Test Plan:
      * New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
      * Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.
      
      Reviewed By: ajkr
      
      Differential Revision: D46582680
      
      Pulled By: cbi42
      
      fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
      87a21d08
  12. 29 7月, 2023 1 次提交
    • P
      Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653) · b3c54186
      Peter Dillinger 提交于
      Summary:
      In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.
      
      Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
      * (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
      * Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.
      
      Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
      * Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
      * Use unique_ptr (nicely complements removing unnecessary Rollbacks)
      * Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)
      
      Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653
      
      Test Plan:
      this is the test :)
      
      Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.
      
      Reviewed By: cbi42
      
      Differential Revision: D47882995
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
      b3c54186
  13. 20 6月, 2023 1 次提交
  14. 18 6月, 2023 1 次提交
    • J
      Stress/Crash Test for OptimisticTransactionDB (#11513) · 17d52005
      Jay Huh 提交于
      Summary:
      Context:
      OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB.
      1. Adding the following gflag options to to test OptimisticTransactionDB
      - `use_optimistic_txn`: When true, open OptimisticTransactionDB to test
      - `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default.
      - `share_occ_lock_buckets`: Use shared occ locks
      - `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock.
      2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc`
      3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk
      
      Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513
      
      Test Plan:
      **Crash Test**
      Serial Validation
      ```
      export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0"
      make crash_test -j
      ```
      Parallel Validation (no share bucket)
      ```
      export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0"
      make crash_test -j
      ```
      Parallel Validation (share bucket)
      ```
      export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500"
      make crash_test -j
      ```
      
      **Stress Test**
      ```
      ./db_stress -use_optimistic_txn -threads=32
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D46547387
      
      Pulled By: jaykorean
      
      fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960
      17d52005
  15. 26 5月, 2023 1 次提交
    • J
      Add WaitForCompact with WaitForCompactOptions to public API (#11436) · 81aeb159
      Jay Huh 提交于
      Summary:
      Context:
      
      This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`
      
      1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
      2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.
      
      1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
      2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
      3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436
      
      Test Plan:
      Following tests added
      - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
      - `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
      - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
      - `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
      - `TransactionTest::WaitForCompactAbortOnPause`
      
      NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.
      
      Reviewed By: pdillinger
      
      Differential Revision: D45799659
      
      Pulled By: jaykorean
      
      fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
      81aeb159
  16. 18 5月, 2023 1 次提交
    • J
      Remove wait_unscheduled from waitForCompact internal API (#11443) · 586d78b3
      Jay Huh 提交于
      Summary:
      Context:
      
      In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
      
      In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
      
      This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
      
      Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
      
      Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
      
      Test Plan:
      Existing tests that involve a shutdown in progress:
      
      - DBCompactionTest::CompactRangeShutdownWhileDelayed
      - DBTestWithParam::PreShutdownMultipleCompaction
      - DBTestWithParam::PreShutdownCompactionMiddle
      
      Reviewed By: pdillinger
      
      Differential Revision: D45923426
      
      Pulled By: jaykorean
      
      fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
      586d78b3
  17. 16 5月, 2023 1 次提交
    • H
      Support parallel read and write/delete to same key in NonBatchedOpsStressTest (#11058) · 5fc57eec
      Hui Xiao 提交于
      Summary:
      **Context:**
      Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage.
      
      To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state.
      
      **Summary:**
      Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel.
      
      Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted).
      
      Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept.
      
      (For detailed algorithm of how to use these parts to infer expected state of a key, see the PR)
      
      Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058
      
      Test Plan:
      - Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min
      ```
      python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000
      ```
      - Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability)
      - [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key
      
      Reviewed By: ajkr
      
      Differential Revision: D42257258
      
      Pulled By: hx235
      
      fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb
      5fc57eec
  18. 26 4月, 2023 1 次提交
    • C
      Block per key-value checksum (#11287) · 62fc15f0
      Changyu Bi 提交于
      Summary:
      add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
      1. checksum construction and verification in block.cc/h
      2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
      3. unit tests/crash test updates
      
      Tests:
      * Added unit tests
      * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
      
      Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
      
      Performance:
      Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
      For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
      
      ```
      SETUP
      make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
      ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
      
      BENCHMARK
      ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
      
      The readrandom ops/sec looks like the following:
      Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
      Main              240805     223604         198176           161653       139040
      PR prot_bytes=0   238691     226693         200127           161082       141153
      PR prot_bytes=1   214983     193199         178532           137013       108211
      prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
      prot_bytes=0
      ```
      
      The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
      
      Reviewed By: ajkr
      
      Differential Revision: D43970708
      
      Pulled By: cbi42
      
      fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
      62fc15f0
  19. 22 4月, 2023 1 次提交
    • H
      Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) · 151242ce
      Hui Xiao 提交于
      Summary:
      **Context:**
      The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
      
      **Summary**
      - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
         - Fixed the default histogram in `RandomAccessFileReader`
      - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
      - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
      
      Test Plan:
      - **Stress test**
      - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
           - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
      ```
      ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
      ```
      ```
      // BlockBasedTable
      rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
      rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
      rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
      
      // PlainTable
      Does not apply
      ```
      - **Db bench 2: performance**
      
      **Read**
      
      SETUP: db with 900 files
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
      ```run till convergence
      ```
      ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
      ```
      Pre-change
      `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
      Post-change (no regression, -0.3%)
      `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
      
      **Compaction/Flush**run till convergence
      ```
      ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
      
      rocksdb.sst.read.micros  COUNT : 33820
      rocksdb.sst.read.flush.micros COUNT : 1800
      rocksdb.sst.read.compaction.micros COUNT : 32020
      ```
      Pre-change
      `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Post-change (no regression, ~-0.4%)
      `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Reviewed By: ajkr
      
      Differential Revision: D44007011
      
      Pulled By: hx235
      
      fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
      151242ce
  20. 21 4月, 2023 1 次提交
    • A
      Fix race condition in db_stress checkpoint cleanup (#11389) · 6cac4c79
      Andrew Kryczka 提交于
      Summary:
      The old cleanup code had a race condition:
      
      1. Test thread: DestroyDB() marked a file as trash
      2. DeleteScheduler thread: Got the file's size and decided to delete it in chunks
      3. Test thread: DestroyDir() deleted that trash file
      4. DeleteScheduler thread: Began deleting in chunks starting by calling ReopenWritableFile(). Unfortunately this recreates the deleted trash file
      5. Test thread: DestroyDir() fails to remove the parent directory because it contains the file created in 4.
      6. Test thread: Checkpoint::Create() fails due to the directory already existing
      
      It could be repro'd with the following patch/command.
      
      Patch:
      
      ```
       diff --git a/file/delete_scheduler.cc b/file/delete_scheduler.cc
      index 8a2d1615d..337d24a60 100644
       --- a/file/delete_scheduler.cc
      +++ b/file/delete_scheduler.cc
      @@ -317,6 +317,12 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash,
                                                  &num_hard_links, nullptr);
             if (my_status.ok()) {
               if (num_hard_links == 1) {
      +          // Give some time for DestroyDir() to delete file entries. Then, the
      +          // below `ReopenWritableFile()` will recreate files, preventing the
      +          // parent directory from being deleted.
      +          if (rand() % 2 == 0) {
      +            usleep(1000);
      +          }
                 std::unique_ptr<FSWritableFile> wf;
                 my_status = fs_->ReopenWritableFile(path_in_trash, FileOptions(), &wf,
                                                     nullptr);
       diff --git a/file/file_util.cc b/file/file_util.cc
      index 43608fcdc..2cee1ad8e 100644
       --- a/file/file_util.cc
      +++ b/file/file_util.cc
      @@ -263,6 +263,13 @@ Status DestroyDir(Env* env, const std::string& dir) {
           }
         }
      
      +  // Give some time for the DeleteScheduler thread's ReopenWritableFile() to
      +  // recreate deleted files
      +  if (dir.find("checkpoint") != std::string::npos) {
      +    fprintf(stderr, "waiting to destroy %s\n", dir.c_str());
      +    usleep(10000);
      +  }
      +
         if (s.ok()) {
           s = env->DeleteDir(dir);
           // DeleteDir might or might not report NotFound
      ```
      
      Command:
      
      ```
      TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=131072 --target_file_size_base=131072 --max_bytes_for_level_base=524288 --checkpoint_one_in=100 --clear_column_family_one_in=0  --max_key=1000 --value_size_mult=33 --sst_file_manager_bytes_per_truncate=4096 --sst_file_manager_bytes_per_sec=1048576  --interval=3 --compression_type=none --sync_fault_injection=1
      ```
      
      Obviously we don't want to use scheduled deletion here as we need the checkpoint directory deleted immediately. I suspect the DestroyDir() was an attempt to fixup incomplete DestroyDB()s. Now that we expect DestroyDB() to be complete I removed that code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11389
      
      Reviewed By: hx235
      
      Differential Revision: D45137142
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2af743d342c77cc414fd25fc4c9d7c9c6079ad24
      6cac4c79
  21. 30 3月, 2023 1 次提交
  22. 18 3月, 2023 2 次提交
    • P
      HyperClockCache support for SecondaryCache, with refactoring (#11301) · 204fcff7
      Peter Dillinger 提交于
      Summary:
      Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
      
      This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
      
      It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
      
      ## cache.h (public API)
      Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
      
      ## advanced_cache.h (advanced public API)
      * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
      * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
      
      These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
      
      * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
      
      I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
      
      ## cacheable_entry.h
      A couple of functions are obsolete because Cache::Handle can no longer be pending.
      
      ## cache.cc
      Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
      
      ## secondary_cache_adapter.{h,cc}
      The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
      
      ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
      Simply updated for Cache API changes.
      
      ## lru_cache.{h,cc}
      Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
      
      ## clock_cache.{h,cc}
      Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
      
      ## block_based_table_reader*
      Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
      
      Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
      
      Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
      
      ## Intended follow-up work
      * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
      * Stacked secondary caches (see above discussion)
      * See if we can make up for the small MultiGet performance regression.
      * Study more performance with SecondaryCache
      * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
      * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
      * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
      
      Test Plan:
      ## Unit tests
      Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
      
      ## Crash/stress test
      Updated to use the new combination.
      
      ## Performance
      First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
      
      ```
      (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
      ```
      
      **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
      HyperClockCache: 3168662 (average parallel ops/sec)
      LRUCache: 2940127
      
      **After** this and #11299, running for about an hour:
      HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
      LRUCache: 2940928 (0.03% faster)
      
      This is an acceptable difference IMHO.
      
      Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
      
      Create DB and test (before and after tests running simultaneously):
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec
      
      So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec
      
      Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
      
      Let's also look at Get() in db_bench:
      
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
      readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
      readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec
      
      That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
      
      **Before**:
      readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
      readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
      readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D44177044
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
      204fcff7
    • L
      Increase the stress test coverage of GetEntity (#11303) · a72d55c9
      Levi Tamasi 提交于
      Summary:
      The `GetEntity` API is currently used in the stress tests for verification purposes;
      this patch extends the coverage by adding a mode where all point lookups in
      the non-batched, batched, and CF consistency stress tests are done using this API.
      The PR also includes a bit of refactoring to eliminate some boilerplate code around
      the wide-column consistency checks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11303
      
      Test Plan: Ran stress tests of the batched, non-batched, and CF consistency varieties.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44148503
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fecdbfd3e65a459bbf16ab7aa7b9173e19240077
      a72d55c9
  23. 23 2月, 2023 1 次提交
    • Y
      Support iter_start_ts in integrated BlobDB (#11244) · f007b8fd
      Yu Zhang 提交于
      Summary:
      Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244
      
      Test Plan:
      ```make check
      ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
      tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```
      
      Reviewed By: ltamasi
      
      Differential Revision: D43506726
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
      f007b8fd
  24. 14 2月, 2023 1 次提交
    • Y
      Enable crash test to run BlobDB together with user-defined timestamp (#11199) · c19672c1
      Yu Zhang 提交于
      Summary:
      I missed a stress test code sanity check when enabling this combination of tests. This PR addresses that, the "iter_start_ts" function for user defined timestamp feature is not supported when BlobDB is enabled. It's disabled for now.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11199
      
      Test Plan:
      Locally always enable BlobDB and run
      tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887
      
      Reviewed By: ltamasi
      
      Differential Revision: D43245657
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 4cae19817bb1afd50a76f9e0e49f006fb5c0b211
      c19672c1
  25. 31 1月, 2023 2 次提交
    • P
      Cleanup, improve, stress test LockWAL() (#11143) · 94e3beec
      Peter Dillinger 提交于
      Summary:
      The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
      
      * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
      * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
      * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
      * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
      * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
      * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
      
      Test Plan: unit tests added / updated, added to stress/crash test
      
      Reviewed By: ajkr
      
      Differential Revision: D42848627
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
      94e3beec
    • S
      DB Stress to fix a false assertion (#11164) · 36174d89
      sdong 提交于
      Summary:
      Seeting this error in stress test:
      
      db_stress: internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2459: void rocksdb::StressTest::Open(rocksdb::SharedState *): Assertion `txn_db_ == nullptr' failed. Received signal 6 (Aborted)
      ......
      
      It doesn't appear that txn_db_ is set to nullptr at all. We set ithere.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11164
      
      Test Plan: Run db_stress transaction and non-transation with low kill rate and see restarting without assertion
      
      Reviewed By: ajkr
      
      Differential Revision: D42855662
      
      fbshipit-source-id: 06816d37cce9c94a81cb54ab238fb73aa102ed46
      36174d89
  26. 28 1月, 2023 2 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
    • Y
      Remove deprecated util functions in options_util.h (#11126) · 6943ff6e
      Yu Zhang 提交于
      Summary:
      Remove the util functions in options_util.h that have previously been marked deprecated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126
      
      Test Plan: `make check`
      
      Reviewed By: ltamasi
      
      Differential Revision: D42757496
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
      6943ff6e
  27. 25 1月, 2023 1 次提交
  28. 04 1月, 2023 1 次提交
    • H
      Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063) · b965a5a8
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR.
      
      It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false`
      
      ```
      --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0  --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65
      ```
      
      Therefore this PR is adding it back to stress/crash test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063
      
      Test Plan: Rehearsal stress test to make sure stress/crash test is stable
      
      Reviewed By: ajkr
      
      Differential Revision: D42283650
      
      Pulled By: hx235
      
      fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef
      b965a5a8
  29. 16 12月, 2022 1 次提交
    • A
      Enable ReadAsync testing and fault injection in db_stress (#11037) · c3f720c6
      anand76 提交于
      Summary:
      The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037
      
      Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42061290
      
      Pulled By: anand1976
      
      fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074
      c3f720c6
  30. 30 11月, 2022 1 次提交
  31. 18 11月, 2022 1 次提交
  32. 17 11月, 2022 1 次提交
  33. 26 10月, 2022 2 次提交
    • L
      Adjust value generation in batched ops stress tests (#10872) · d4842752
      Levi Tamasi 提交于
      Summary:
      The patch adjusts the generation of values in batched ops stress tests so that the digits 0..9 are appended (instead of prepended) to the values written. This has the advantage of aligning the encoding of the "value base" into the value string across non-batched, batched, and CF consistency stress tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10872
      
      Test Plan: Tested using some black box stress test runs.
      
      Reviewed By: riversand963
      
      Differential Revision: D40692847
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 26bf8adff2944cbe416665f09c3bab89d80416b3
      d4842752
    • H
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos... · fc74abb4
      Hui Xiao 提交于
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
      
      Summary:
      **Context:**
      Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
      Repro:
      ```
      COERCE_CONTEXT_SWICH=1 make -j56 db_stress
      
      ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
      
      put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
      ```
      
      **Summary:**
      FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
      -  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
         - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
         - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
      - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
        - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
        - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
      - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
      - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
          - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
      
      Some additional clean-up included in this PR:
      - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
      - Added comment about `earliest_memtable_seqno` in related APIs
      - Made parameter `earliest_memtable_seqno` constant and required
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
      
      Test Plan:
      - make check
      - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
      - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only
      
      Reviewed By: ajkr
      
      Differential Revision: D40090485
      
      Pulled By: hx235
      
      fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
      fc74abb4
  34. 17 10月, 2022 1 次提交
  35. 14 10月, 2022 1 次提交