1. 22 11月, 2022 5 次提交
    • A
      Post 7.9.0 release branch cut updates (#10974) · f4cfcfe8
      anand76 提交于
      Summary:
      Update HISTORY.md, version.h, and check_format_compatible.sh
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41455289
      
      Pulled By: anand1976
      
      fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b
      f4cfcfe8
    • A
      Update HISTORY.md for 7.9.0 (#10973) · 3ff6da6b
      anand76 提交于
      Summary:
      Update HISTORY.md for 7.9.0 release.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973
      
      Reviewed By: pdillinger
      
      Differential Revision: D41453720
      
      Pulled By: anand1976
      
      fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7
      3ff6da6b
    • P
      Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945) · e079d562
      Peter Dillinger 提交于
      Summary:
      Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945
      
      Test Plan: existing tests for CacheDumper, added basic unit test
      
      Reviewed By: anand1976
      
      Differential Revision: D41231497
      
      Pulled By: pdillinger
      
      fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
      e079d562
    • A
      Fix CompactionIterator flag for penultimate level output (#10967) · 097f9f44
      Andrew Kryczka 提交于
      Summary:
      We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967
      
      Test Plan:
      - built `db_bench` with DEBUG_LEVEL=0
      - ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
      - compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
        - Before this fix, Y was always zero
        - After this fix, Y gradually increased throughout the benchmark
      
      Reviewed By: riversand963
      
      Differential Revision: D41417726
      
      Pulled By: ajkr
      
      fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
      097f9f44
    • P
      Observe and warn about misconfigured HyperClockCache (#10965) · 3182beef
      Peter Dillinger 提交于
      Summary:
      Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.
      
      Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965
      
      Test Plan:
      unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.
      
      Also manual testing with db_bench. Create db with
      ```
      ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
      ```
      Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
      ```
      ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
      ```
      Seeing warnings / errors or not as expected.
      
      Reviewed By: anand1976
      
      Differential Revision: D41406932
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
      3182beef
  2. 18 11月, 2022 1 次提交
    • P
      Mark HyperClockCache as production-ready (#10963) · 8c0f5b1f
      Peter Dillinger 提交于
      Summary:
      After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963
      
      Test Plan: existing tests, comment changes only
      
      Reviewed By: siying
      
      Differential Revision: D41384083
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
      8c0f5b1f
  3. 17 11月, 2022 1 次提交
  4. 14 11月, 2022 1 次提交
  5. 12 11月, 2022 2 次提交
    • P
      Don't attempt to use SecondaryCache on block_cache_compressed (#10944) · f321e8fc
      Peter Dillinger 提交于
      Summary:
      Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!
      
      This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.
      
      Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.
      
      This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944
      
      Test Plan: test added that fails on base revision (at least with ASAN)
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41205578
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
      f321e8fc
    • A
      Fix async_io regression in scans (#10939) · d1aca4a5
      akankshamahajan 提交于
      Summary:
      Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939
      
      Test Plan:
      - stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
          make crash_test -j32
      - Ran db_bench command which was caught the regression:
      ./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288
      
      seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)
      
      Reviewed By: anand1976
      
      Differential Revision: D41173899
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe
      d1aca4a5
  6. 08 11月, 2022 1 次提交
    • L
      Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925) · fbd9077d
      Levi Tamasi 提交于
      Summary:
      The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D41096453
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
      fbd9077d
  7. 05 11月, 2022 1 次提交
  8. 02 11月, 2022 1 次提交
    • A
      Fix async_io failures in case there is error in reading data (#10890) · ff9ad2c3
      akankshamahajan 提交于
      Summary:
      Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
      Fix: Added check to abort IO in second buffer if curr_ got empty.
      
      This PR also fixes db_stress failures which happened when buffers are not aligned.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890
      
      Test Plan:
      - Ran make crash_test -j32 with async_io enabled.
      -  Ran benchmarks to make sure there is no regression.
      
      Reviewed By: anand1976
      
      Differential Revision: D40881731
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
      ff9ad2c3
  9. 01 11月, 2022 3 次提交
    • Y
      Basic Support for Merge with user-defined timestamp (#10819) · 7d26e4c5
      Yanqin Jin 提交于
      Summary:
      This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.
      
      Simplest usage:
      ```cpp
      // assume string append merge op is used with '.' as delimiter.
      // ts1 < ts2
      db->Put(WriteOptions(), "key", ts1, "v0");
      db->Merge(WriteOptions(), "key", ts2, "1");
      ReadOptions ro;
      ro.timestamp = &ts2;
      db->Get(ro, "key", &value);
      ASSERT_EQ("v0.1", value);
      ```
      
      Some code comments are added for clarity.
      
      Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D40603195
      
      Pulled By: riversand963
      
      fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
      7d26e4c5
    • H
      Move move wrong history entry out of 7.8 release (#10898) · 7f5e438a
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      
      https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)".
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898
      
      Test Plan: Make check
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D40861001
      
      Pulled By: hx235
      
      fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51
      7f5e438a
    • S
      Avoid repeat periodic stats printing when there is no change (#10891) · d989300a
      sdong 提交于
      Summary:
      When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891
      
      Test Plan: Manually test the behavior with hacked db_bench setups.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40777183
      
      fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab
      d989300a
  10. 28 10月, 2022 1 次提交
    • C
      Reduce heap operations for range tombstone keys in iterator (#10877) · 56715350
      Changyu Bi 提交于
      Summary:
      Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877
      
      Test Plan:
      - existing UT
      - ran all flavors of stress test through sandcastle
      - benchmark:
      ```
      # Set up: --writes_per_range_tombstone=1 means one point write and one delete range
      
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64
      
      Level Files Size(MB)
      --------------------
        0        8      152
        1        0        0
        2        0        0
        3        0        0
        4        0        0
        5        0        0
        6        0        0
      
      # Benchmark
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true
      
      # Pre PR
      readseq [AVG    5 runs] : 1432116 (± 59664) ops/sec;  224.0 (± 9.3) MB/sec
      readseq [MEDIAN 5 runs] : 1454886 ops/sec;  227.5 MB/sec
      
      # Post PR
      readseq [AVG    5 runs] : 1944425 (± 29521) ops/sec;  304.1 (± 4.6) MB/sec
      readseq [MEDIAN 5 runs] : 1959430 ops/sec;  306.5 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40710936
      
      Pulled By: cbi42
      
      fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
      56715350
  11. 26 10月, 2022 1 次提交
    • H
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos... · fc74abb4
      Hui Xiao 提交于
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
      
      Summary:
      **Context:**
      Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
      Repro:
      ```
      COERCE_CONTEXT_SWICH=1 make -j56 db_stress
      
      ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
      
      put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
      ```
      
      **Summary:**
      FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
      -  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
         - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
         - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
      - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
        - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
        - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
      - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
      - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
          - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
      
      Some additional clean-up included in this PR:
      - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
      - Added comment about `earliest_memtable_seqno` in related APIs
      - Made parameter `earliest_memtable_seqno` constant and required
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
      
      Test Plan:
      - make check
      - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
      - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only
      
      Reviewed By: ajkr
      
      Differential Revision: D40090485
      
      Pulled By: hx235
      
      fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
      fc74abb4
  12. 24 10月, 2022 1 次提交
  13. 23 10月, 2022 1 次提交
  14. 22 10月, 2022 5 次提交
    • J
      Allow penultimate level output for the last level only compaction (#10822) · f726d29a
      Jay Zhuang 提交于
      Summary:
      Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level.
      (it includes the PR https://github.com/facebook/rocksdb/issues/10829)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822
      
      Reviewed By: siying
      
      Differential Revision: D40389180
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad
      f726d29a
    • P
      Use kXXH3 as default checksum (CPU efficiency) (#10778) · 27c9705a
      Peter Dillinger 提交于
      Summary:
      Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware.
      
      A current DB performance comparison using buck+clang build:
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal
      ```
      kXXH3 (+0.2% DB write throughput):
      `fillseq [AVG    1000 runs] : 822149 (± 1004) ops/sec;   91.0 (± 0.1) MB/sec`
      kCRC32c:
      `fillseq [AVG    1000 runs] : 820484 (± 1203) ops/sec;   90.8 (± 0.1) MB/sec`
      
      Micro benchmark comparison:
      ```
      ./db_bench --benchmarks=xxh3[-X20],crc32c[-X20]
      ```
      Machine 1, buck+clang build:
      `xxh3 [AVG    20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec`
      `crc32c [AVG    20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec`
      
      Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0:
      `xxh3 [AVG    20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec`
      `crc32c [AVG    20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778
      
      Test Plan: make check, unit tests updated
      
      Reviewed By: ajkr
      
      Differential Revision: D40112510
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37
      27c9705a
    • S
      Make UserComparatorWrapper not Customizable (#10837) · 5d17297b
      sdong 提交于
      Summary:
      Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837
      
      Test Plan: Make sure existing tests pass
      
      Reviewed By: pdillinger
      
      Differential Revision: D40528511
      
      fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62
      5d17297b
    • A
      Refactor block cache tracing APIs (#10811) · 0e7b27bf
      akankshamahajan 提交于
      Summary:
      Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.
      
      `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
      New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.
      
      This same philosophy can be applied to KV and IO tracing as well.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811
      
      Test Plan:
      existing unit tests
      Old API DB::StartBlockTrace checked with db_bench tool
      create database
      ```
      ./db_bench --benchmarks="fillseq" \
      --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
      --cache_index_and_filter_blocks --cache_size=1048576 \
      --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
      --min_level_to_compress=-1 --compression_ratio=1 --num=10000000
      ```
      
      To trace block cache accesses when running readrandom benchmark:
      ```
      ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
      --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
      --cache_index_and_filter_blocks --cache_size=1048576 \
      --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
      --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
      --threads=16 \
      -block_cache_trace_file="/tmp/binary_trace_test_example" \
      -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
      -block_cache_trace_sampling_frequency=1
      
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D40435289
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
      0e7b27bf
    • C
      Ignore max_compaction_bytes for compaction input that are within output key-range (#10835) · 333abe9c
      Changyu Bi 提交于
      Summary:
      When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835
      
      Test Plan:
      - CI, a unit test on max_compaction_bytes fails before turning this flag off.
      - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432`
      ```
      main:
      ** Compaction Stats [default] **
      Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        L0      3/0   91.59 MB   0.8     70.9     0.0     70.9     200.8    129.9       0.0   1.5     25.2     71.2   2886.55           2463.45      9725    0.297   1093M   254K       0.0       0.0
        L1      9/0   248.03 MB   1.0    392.0   129.8    262.2     391.7    129.5       0.0   3.0     69.0     68.9   5821.71           5536.90       804    7.241   6029M  5814K       0.0       0.0
        L2     87/0    2.50 GB   1.0    537.0   128.5    408.5     533.8    125.2       0.7   4.2     69.5     69.1   7912.24           7323.70      4417    1.791   8299M    36M       0.0       0.0
        L3    836/0   24.99 GB   1.0    616.9   118.3    498.7     594.5     95.8       5.2   5.0     66.9     64.5   9442.38           8490.28      4204    2.246   9749M   306M       0.0       0.0
        L4   2355/0   62.95 GB   0.3     67.3    37.1     30.2      54.2     24.0      38.9   1.5     72.2     58.2    954.37            821.18       917    1.041   1076M   173M       0.0       0.0
       Sum   3290/0   90.77 GB   0.0   1684.2   413.7   1270.5    1775.0    504.5      44.9  13.7     63.8     67.3  27017.25          24635.52     20067    1.346     26G   522M       0.0       0.0
      
      Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds
      
      This PR:
      ** Compaction Stats [default] **
      Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        L0      3/0   45.71 MB   0.8     72.9     0.0     72.9     202.8    129.9       0.0   1.6     25.4     70.7   2938.16           2510.36      9741    0.302   1124M   265K       0.0       0.0
        L1      8/0   234.54 MB   0.9    384.5   129.8    254.7     384.2    129.6       0.0   3.0     69.0     68.9   5708.08           5424.43       791    7.216   5913M  5753K       0.0       0.0
        L2     84/0    2.47 GB   1.0    543.1   128.6    414.5     539.9    125.4       0.7   4.2     69.6     69.2   7989.31           7403.13      4418    1.808   8393M    36M       0.0       0.0
        L3    839/0   24.96 GB   1.0    615.6   118.4    497.2     593.2     96.0       5.1   5.0     66.6     64.1   9471.23           8489.31      4193    2.259   9726M   306M       0.0       0.0
        L4   2360/0   63.04 GB   0.3     67.6    37.3     30.3      54.4     24.1      38.9   1.5     71.5     57.6    967.30            827.99       907    1.066   1080M   173M       0.0       0.0
       Sum   3294/0   90.75 GB   0.0   1683.8   414.2   1269.6    1774.5    504.9      44.8  13.7     63.7     67.1  27074.08          24655.22     20050    1.350     26G   522M       0.0       0.0
      
      Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40518319
      
      Pulled By: cbi42
      
      fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
      333abe9c
  15. 21 10月, 2022 1 次提交
    • A
      Add DB property for fast block cache stats collection (#10832) · 33ceea9b
      Andrew Kryczka 提交于
      Summary:
      This new property allows users to trigger the background block cache stats collection mode through the `GetProperty()` and `GetMapProperty()` APIs. The background mode has much lower overhead at the expense of returning stale values in more cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10832
      
      Test Plan: updated unit test
      
      Reviewed By: pdillinger
      
      Differential Revision: D40497883
      
      Pulled By: ajkr
      
      fbshipit-source-id: bdcc93402f426463abb2153756aad9e295447343
      33ceea9b
  16. 19 10月, 2022 1 次提交
    • Y
      Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348) · e267909e
      Yueh-Hsuan Chiang 提交于
      Summary:
      FIFO compaction can theoretically open a DB with any compaction style.
      However, the current code only allows FIFO compaction to open a DB with
      a single level.
      
      This PR relaxes the limitation of FIFO compaction and allows it to open a
      DB with multiple levels.  Below is the read / write / compaction behavior:
      
      * The read behavior is untouched, and it works like a regular rocksdb instance.
      * The write behavior is untouched as well.  When a FIFO compacted DB
      is opened with multiple levels, all new files will still be in level 0, and no files
      will be moved to a different level.
      * Compaction logic is extended.  It will first identify the bottom-most non-empty level.
      Then, it will delete the oldest file in that level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348
      
      Test Plan:
      Added a new test to verify the migration from level to FIFO where the db has multiple levels.
      Extended existing test cases in db_test and db_basic_test to also verify
      all entries of a key after reopening the DB with FIFO compaction.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40233744
      
      fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
      e267909e
  17. 11 10月, 2022 3 次提交
    • J
      Allow the last level data moving up to penultimate level (#10782) · 5a5f21c4
      Jay Zhuang 提交于
      Summary:
      Lock the penultimate level for the whole compaction inputs range, so any
      key in that compaction is safe to move up from the last level to
      penultimate level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782
      
      Reviewed By: siying
      
      Differential Revision: D40231540
      
      Pulled By: siying
      
      fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e
      5a5f21c4
    • P
      Allow manifest fix-up without requiring prior state (#10796) · 2d0380ad
      Peter Dillinger 提交于
      Summary:
      This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.
      
      Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796
      
      Test Plan: tests added/updated, that fail before the change(s) and now pass
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40232645
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be
      2d0380ad
    • A
      Provide support for async_io with tailing iterators (#10781) · ebf8c454
      akankshamahajan 提交于
      Summary:
      Provide support for async_io if ReadOptions.tailing is set true.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781
      
      Test Plan:
      - Update unit tests
      - Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1
      
      Reviewed By: anand1976
      
      Differential Revision: D40128882
      
      Pulled By: anand1976
      
      fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01
      ebf8c454
  18. 08 10月, 2022 1 次提交
    • J
      Add option `preserve_internal_time_seconds` to preserve the time info (#10747) · c401f285
      Jay Zhuang 提交于
      Summary:
      Add option `preserve_internal_time_seconds` to preserve the internal
      time information.
      It's mostly for the migration of the existing data to tiered storage (
      `preclude_last_level_data_seconds`). When the tiering feature is just
      enabled, the existing data won't have the time information to decide if
      it's hot or cold. Enabling this feature will start collect and preserve
      the time information for the new data.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747
      
      Reviewed By: siying
      
      Differential Revision: D39910141
      
      Pulled By: siying
      
      fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
      c401f285
  19. 07 10月, 2022 1 次提交
    • P
      Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768) · b205c6d0
      Peter Dillinger 提交于
      Summary:
      We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
      
      This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
      
      Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
      
      Reviewed By: anand1976
      
      Differential Revision: D40034747
      
      Pulled By: anand1976
      
      fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
      b205c6d0
  20. 06 10月, 2022 2 次提交
    • Y
      Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773) · 4d82b948
      Yanqin Jin 提交于
      Summary:
      With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773
      
      Test Plan:
      Reproduce: D39993203 has detailed steps.
      Run the test with and without the fix.
      
      Reviewed By: cbi42
      
      Differential Revision: D40077955
      
      Pulled By: cbi42
      
      fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
      4d82b948
    • C
      Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767) · eca47fb6
      Changyu Bi 提交于
      Summary:
      fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles)  if allow_ingest_behind is enabled and the bottommost level is unfilled.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767
      
      Test Plan: Added a unit test to reproduce the compaction loop.
      
      Reviewed By: ajkr
      
      Differential Revision: D40031861
      
      Pulled By: ajkr
      
      fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b
      eca47fb6
  21. 05 10月, 2022 3 次提交
  22. 04 10月, 2022 1 次提交
    • A
      Add new property in IOOptions to skip recursing through directories and list... · ae0f9c33
      akankshamahajan 提交于
      Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668)
      
      Summary:
      Add new property "do_not_recurse" in  IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
      By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.
      
      Provided support in PosixFileSystem to use "do_not_recurse".
      
      TestPlan:
      - Existing tests
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668
      
      Reviewed By: anand1976
      
      Differential Revision: D39471683
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
      ae0f9c33
  23. 01 10月, 2022 1 次提交
    • C
      User-defined timestamp support for `DeleteRange()` (#10661) · 9f2363f4
      Changyu Bi 提交于
      Summary:
      Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
      - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
      - Garbage collection of range tombstones and range tombstone covered keys during compaction.
      - Get()/MultiGet() to return the timestamp of a range tombstone when needed.
      - Get/Iterator with range tombstones bounded by readoptions.timestamp.
      - timestamp crash test now issues DeleteRange by default.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
      
      Test Plan:
      - Added unit test: `make check`
      - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
      - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.
      
      | micros/op | fillrandom | seekrandom |
      | --- | --- | --- |
      |main| 2.58 |10.96|
      |PR 10661| 2.68 |10.63|
      
      Reviewed By: riversand963
      
      Differential Revision: D39441192
      
      Pulled By: cbi42
      
      fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
      9f2363f4
  24. 30 9月, 2022 1 次提交
    • J
      Align compaction output file boundaries to the next level ones (#10655) · f3cc6663
      Jay Zhuang 提交于
      Summary:
      Try to align the compaction output file boundaries to the next level ones
      (grandparent level), to reduce the level compaction write-amplification.
      
      In level compaction, there are "wasted" data at the beginning and end of the
      output level files. Align the file boundary can avoid such "wasted" compaction.
      With this PR, it tries to align the non-bottommost level file boundaries to its
      next level ones. It may cut file when the file size is large enough (at least
      50% of target_file_size) and not too large (2x target_file_size).
      
      db_bench shows about 12.56% compaction reduction:
      ```
      TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
      
      # baseline:
      Flush(GB): cumulative 25.882, interval 7.216
      Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
      
      # with this change:
      Flush(GB): cumulative 25.882, interval 7.753
      Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
      ```
      
      The compaction simulator shows a similar result (14% with 100G random data).
      As a side effect, with this PR, the SST file size can exceed the
      target_file_size, but is capped at 2x target_file_size. And there will be
      smaller files. Here are file size statistics when loading 100GB with the target
      file size 32MB:
      ```
                baseline      this_PR
      count  1.656000e+03  1.705000e+03
      mean   3.116062e+07  3.028076e+07
      std    7.145242e+06  8.046139e+06
      ```
      
      The feature is enabled by default, to revert to the old behavior disable it
      with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
      
      Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
      use case like user adding 2 or more non-overlapping data range at the same
      time, it can reduce the overlapping of 2 datasets in the lower levels.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
      
      Reviewed By: cbi42
      
      Differential Revision: D39552321
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
      f3cc6663