1. 16 3月, 2023 1 次提交
    • H
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes,... · bab5f9a6
      Hui Xiao 提交于
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
      
      Summary:
      **Context/Summary:**
      We are adding new stats to measure behavior of prefetched tail size and look up into this buffer
      
      The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265
      
      Test Plan:
      **- Piggyback on existing test**
      **- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
      ```
      ./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
      ```
      ```
      rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
      rocksdb.table.open.prefetch.tail.miss COUNT : 91
      rocksdb.table.open.prefetch.tail.hit COUNT : 1034
      ```
      **- No perf regression observed in db_bench**
      
      SETUP command: create same db with ~900 files for pre-change/post-change.
      ```
      ./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
      ```
      TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
      ```
      ./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
      ```
      async io = 0, direct io read = true
      
        | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
      -- | -- | -- | --
      pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
      post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec
      
      async io = 1, direct io read = true
        | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
      -- | -- | -- | --
      pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
      post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D43781467
      
      Pulled By: hx235
      
      fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
      bab5f9a6
  2. 15 3月, 2023 1 次提交
    • H
      Fix bug of prematurely excluded CF in atomic flush contains unflushed data... · 11cb6af6
      Hui Xiao 提交于
      Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
      
      Summary:
      **Context:**
      Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
      ```
      ./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
          pid=$!
          sleep 0.2
          sleep 10
          kill $pid
          sleep 0.2
      ./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
          pid=$!
          sleep 0.2
          sleep 40
          kill $pid
          sleep 0.2
      
      Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
      Crash-recovery verification failed :(
      No writes or ops?
      Verification failed :(
      ```
      
      The bug is due to the following:
      - When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
      - But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
      - However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
      
      **Summary:**
      - Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
      - For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
      
      Test Plan:
      - New unit test failed before the fix and passes after
      - Make check
      - Rehearsal stress test
      
      Reviewed By: ajkr
      
      Differential Revision: D42799871
      
      Pulled By: hx235
      
      fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
      11cb6af6
  3. 14 3月, 2023 2 次提交
    • L
      Rename a recently added PerfContext counter (#11294) · 49881921
      Levi Tamasi 提交于
      Summary:
      The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D44035964
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
      49881921
    • C
      Support range deletion tombstones in `CreateColumnFamilyWithImport` (#11252) · 9aa3b6f9
      Changyu Bi 提交于
      Summary:
      CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
      1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
      2. it does not handle files with no point keys.
      
      Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).
      
      This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252
      
      Test Plan:
      - added unit tests that fails before this change
      
      Closes https://github.com/facebook/rocksdb/issues/11245
      
      Reviewed By: ajkr
      
      Differential Revision: D43577443
      
      Pulled By: cbi42
      
      fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
      9aa3b6f9
  4. 09 3月, 2023 1 次提交
    • L
      Add a PerfContext counter for merge operands applied in point lookups (#11284) · 1d524385
      Levi Tamasi 提交于
      Summary:
      The existing PerfContext counter `internal_merge_count` only tracks the
      Merge operands applied during range scans. The patch adds a new counter
      called `internal_merge_count_point_lookups` to track the same metric
      for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
      also fixes a couple of cases in the iterator where the existing counter wasn't
      updated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D43926082
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
      1d524385
  5. 02 3月, 2023 1 次提交
    • Y
      Fix backward iteration issue when user defined timestamp is enabled in BlobDB (#11258) · 8dfcfd4e
      Yu Zhang 提交于
      Summary:
      During backward iteration, blob verification would fail because the user key (ts included) in `saved_key_` doesn't match the blob. This happens because during`FindValueForCurrentKey`, `saved_key_` is not updated when the user key(ts not included) is the same for all cases except when `timestamp_lb_` is specified. This breaks the blob verification logic when user defined timestamp is enabled and `timestamp_lb_` is not specified. Fix this by always updating `saved_key_` when a smaller user key (ts included) is seen.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11258
      
      Test Plan:
      `make check`
      `./db_blob_basic_test --gtest_filter=DBBlobWithTimestampTest.IterateBlobs`
      
      Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced with:
      
      `./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5`
      
      Baseline:
      
      - seekrandom [AVG    6 runs] : 72188 (± 1481) ops/sec;   37.2 (± 0.8) MB/sec
      
      With this PR:
      
      - seekrandom [AVG    6 runs] : 74171 (± 1427) ops/sec;   38.2 (± 0.7) MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D43675642
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 8022ae8522d1f66548821855e6eed63640c14e04
      8dfcfd4e
  6. 01 3月, 2023 1 次提交
  7. 23 2月, 2023 2 次提交
    • Y
      Support iter_start_ts in integrated BlobDB (#11244) · f007b8fd
      Yu Zhang 提交于
      Summary:
      Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244
      
      Test Plan:
      ```make check
      ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
      tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```
      
      Reviewed By: ltamasi
      
      Differential Revision: D43506726
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
      f007b8fd
    • C
      Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) · 229297d1
      Changyu Bi 提交于
      Summary:
      A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).
      
      The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113
      
      Test Plan:
      * added unit test in db_range_del_test
      * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`
      
      Reviewed By: ajkr
      
      Differential Revision: D42655709
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
      229297d1
  8. 22 2月, 2023 1 次提交
  9. 18 2月, 2023 2 次提交
    • M
      Remove FactoryFunc from LoadXXXObject (#11203) · b6640c31
      mrambacher 提交于
      Summary:
      The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.
      
      Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203
      
      Reviewed By: cbi42
      
      Differential Revision: D43160255
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
      b6640c31
    • A
      Merge operator failed subcode (#11231) · 25e13652
      Andrew Kryczka 提交于
      Summary:
      From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.
      
      This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231
      
      Test Plan: updated unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43396607
      
      Pulled By: ajkr
      
      fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
      25e13652
  10. 17 2月, 2023 2 次提交
  11. 16 2月, 2023 1 次提交
  12. 10 2月, 2023 1 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
  13. 09 2月, 2023 1 次提交
    • A
      Fix bug in WAL streaming uncompression (#11198) · 77b61abc
      anand76 提交于
      Summary:
      Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198
      
      Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.
      
      Reviewed By: ajkr, cbi42
      
      Differential Revision: D43102692
      
      Pulled By: anand1976
      
      fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd
      77b61abc
  14. 08 2月, 2023 2 次提交
    • L
      Add compaction filter support for wide-column entities (#11196) · 876d2815
      Levi Tamasi 提交于
      Summary:
      The patch adds compaction filter support for wide-column entities by introducing
      a new `CompactionFilter` API called `FilterV3`. This API is called for regular
      key-values, merge operands, and wide-column entities as well. It is passed the
      existing value/operand or wide-column structure and it can update the value or
      columns or keep/delete/etc. the key-value as usual. For compatibility, the default
      implementation of `FilterV3` keeps all wide-column entities and falls back to calling
      `FilterV2` for plain old key-values and merge operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43094147
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd
      876d2815
    • H
      Remove a couple deprecated convenience.h APIs (#11120) · 6650ca24
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As instructed by convenience.h comments, a few deprecated APIs are removed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11120
      
      Test Plan:
      - make check & CI
      - eyeball check on test semantics.
      
      Reviewed By: pdillinger
      
      Differential Revision: D42937507
      
      Pulled By: hx235
      
      fbshipit-source-id: a9e4709387da01b1d0e9148c2e210f02e9746ee1
      6650ca24
  15. 04 2月, 2023 2 次提交
    • P
      Use LIB_MODE=shared build by default with make (#11168) · cf756ed9
      Peter Dillinger 提交于
      Summary:
      With https://github.com/facebook/rocksdb/issues/11150 this becomes a practical change that I think is overall good for developer efficiency.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11168
      
      Test Plan:
      More efficient build of all unit tests and tools:
      
      ```
      $ git clean -fdx
      $ du -sh .
      522M    .
      $ /usr/bin/time make -j32 LIB_MODE=static
      ...
      14270.63user 1043.33system 11:19.85elapsed 2252%CPU (0avgtext+0avgdata 1929944maxresident)k
      ...
      $ du -sh .
      62G     .
      $
      ```
      Vs.
      ```
      $ git clean -fdx
      $ du -sh .
      522M    .
      $ /usr/bin/time make -j32 LIB_MODE=shared
      ...
      9479.87user 478.26system 7:20.82elapsed 2258%CPU (0avgtext+0avgdata 1929272maxresident)k
      ...
      $ du -sh .
      5.4G    .
      $
      ```
      
      So 1/3 less build time and >90% less space usage.
      
      Individual unit test edit-compile-run is not too different. Modifying an average unit test source file:
      ```
      $ touch db/version_builder_test.cc
      $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test
      ...
      34.74user 3.37system 0:38.29elapsed 99%CPU (0avgtext+0avgdata 945520maxresident)k
      ```
      Vs.
      ```
      $ touch db/version_builder_test.cc
      $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test
      ...
      116.26user 43.91system 0:28.65elapsed 559%CPU (0avgtext+0avgdata 675160maxresident)k
      ```
      A little faster with shared.
      
      However, modifying an average DB implementation file has an extra linking step with shared lib:
      ```
      $ touch db/db_impl/db_impl_files.cc
      $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test
      ...
      33.17user 5.13system 0:39.70elapsed 96%CPU (0avgtext+0avgdata 945544maxresident)k
      ```
      Vs.
      ```
      $ touch db/db_impl/db_impl_files.cc
      $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test
      ...
      40.80user 4.66system 0:45.54elapsed 99%CPU (0avgtext+0avgdata 1056340maxresident)k
      ```
      A little slower with shared.
      
      On the whole, should be faster and lighter weight because of the many unit test files case
      
      Reviewed By: cbi42
      
      Differential Revision: D42894004
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9e827e52ace79b86f849b6a24466e318b4b605a7
      cf756ed9
    • P
      Deprecate write_global_seqno and default to false (#11179) · 0cf1008f
      Peter Dillinger 提交于
      Summary:
      This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way.
      
      Also improved API comments.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179
      
      Test Plan: existing tests (with one tiny update)
      
      Reviewed By: hx235
      
      Differential Revision: D42973927
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab
      0cf1008f
  16. 03 2月, 2023 1 次提交
    • A
      Return any errors returned by ReadAsync to the MultiGet caller (#11171) · 63da9cfa
      anand76 提交于
      Summary:
      Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.
      
      Tests:
      Update existing unit tests and add a new one for this scenario
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42950057
      
      Pulled By: anand1976
      
      fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869
      63da9cfa
  17. 02 2月, 2023 1 次提交
  18. 01 2月, 2023 2 次提交
  19. 31 1月, 2023 2 次提交
    • P
      Cleanup, improve, stress test LockWAL() (#11143) · 94e3beec
      Peter Dillinger 提交于
      Summary:
      The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
      
      * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
      * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
      * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
      * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
      * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
      * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
      
      Test Plan: unit tests added / updated, added to stress/crash test
      
      Reviewed By: ajkr
      
      Differential Revision: D42848627
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
      94e3beec
    • Y
      Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8
      Yu Zhang 提交于
      Summary:
      Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
      
      Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
      
      Test Plan:
      make V=1 db_blob_basic_test
      ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
      
      Reviewed By: ltamasi
      
      Differential Revision: D42716487
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
      24ac53d8
  20. 28 1月, 2023 2 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
    • Y
      Remove deprecated util functions in options_util.h (#11126) · 6943ff6e
      Yu Zhang 提交于
      Summary:
      Remove the util functions in options_util.h that have previously been marked deprecated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126
      
      Test Plan: `make check`
      
      Reviewed By: ltamasi
      
      Differential Revision: D42757496
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
      6943ff6e
  21. 27 1月, 2023 3 次提交
  22. 26 1月, 2023 2 次提交
  23. 25 1月, 2023 3 次提交
    • L
      Remove some deprecated/obsolete statistics from the API (#11123) · 99e55953
      Levi Tamasi 提交于
      Summary:
      These tickers/histograms have been obsolete (and not populated) for a long time.
      The patch removes them from the API completely. Note that this means that the
      numeric values of the remaining tickers change in the C++ code as they get shifted up.
      This should be OK: the values of some existing tickers have changed many times
      over the years as items have been added in the middle. (In contrast, the convention
      in the Java bindings is to keep the ids, which are not guaranteed to be the same
      as the ids on the C++ side, the same across releases.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11123
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42727793
      
      Pulled By: ltamasi
      
      fbshipit-source-id: e058a155a20b05b45f53e67ee380aece1b43b6c5
      99e55953
    • S
      Remove compressed block cache (#11117) · 2800aa06
      sdong 提交于
      Summary:
      Compressed block cache is replaced by compressed secondary cache. Remove the feature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117
      
      Test Plan: See CI passes
      
      Reviewed By: pdillinger
      
      Differential Revision: D42700164
      
      fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
      2800aa06
    • H
      Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job... · 86fa2592
      Hui Xiao 提交于
      Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)
      
      Summary:
      **Context:**
      Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush  `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `
      
      **Summary:**
      Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)
      
      **Tets:**
      - new unit test
      - make check
      - aggressive crash test rehearsal
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111
      
      Reviewed By: ajkr
      
      Differential Revision: D42644600
      
      Pulled By: hx235
      
      fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf
      86fa2592
  24. 24 1月, 2023 1 次提交
  25. 21 1月, 2023 1 次提交
    • A
      Add API to limit blast radius of merge operator failure (#11092) · b7fbcefd
      Andrew Kryczka 提交于
      Summary:
      Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.
      
      To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092
      
      Reviewed By: siying
      
      Differential Revision: D42525673
      
      Pulled By: ajkr
      
      fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
      b7fbcefd
  26. 20 1月, 2023 1 次提交
    • P
      Upgrade xxhash.h to latest dev (#11098) · fd911f96
      Peter Dillinger 提交于
      Summary:
      Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines.
      
      I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example
      * xxhash.h has its own fallthrough marker (which I hope works for us)
      * As in https://github.com/Cyan4973/xxHash/pull/549
      
      Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)`
      
      How I did the upgrade (for future reference), so that I could use usual merge conflict resolution:
      ```
      # New branch to help with merging
      git checkout -b xxh_merge_base
      # Check out RocksDB revision before last xxhash.h upgrade
      git reset --hard 22161b75^
      # Create a commit with the raw base version from xxHash repo (from xxHash repo)
      git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h
      # In RocksDB repo
      git commit -a
      # Merge in the last xxhash.h upgrade
      git merge 22161b75
      # Resolve conflict using committed version
      git show 22161b75:util/xxhash.h > util/xxhash.h
      git commit -a
      # Catch up to upstream
      git merge upstream/main
      
      # Create a different branch for applying raw upgrade
      git checkout -b xxh_upgrade_2023
      # Find the RocksDB commit we made for the raw base version from xxHash
      git log main..HEAD
      # Rewind to it
      git reset --hard 2428b727
      # Copy in latest raw version (from xxHash repo)
      cat xxhash.h > ../rocksdb/util/xxhash.h
      # Merge in RocksDB changes, use typical tools for conflict resolution
      git merge xxh_merge_base
      ```
      
      Branch https://github.com/facebook/rocksdb/tree/xxhash_merge_base can be used as a base for future xxhash merges.
      
      Fixes https://github.com/facebook/rocksdb/issues/11073
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11098
      
      Test Plan:
      existing tests (e.g. Bloom filter schema stability tests)
      
      Also seems to include a small performance boost on my Intel dev machine, using `./db_bench --benchmarks=xxh3[-X50] 2>&1 | egrep -o 'operations;.*' | sort`
      
      Fastest out of 50 runs, before: 15477.3 MB/s
      Fastest out of 50 runs, after: 15850.7 MB/s, and 11 more runs faster than the "before" number
      
      Slowest out of 50 runs, before: 12267.5 MB/s
      Slowest out of 50 runs, after: 13897.1 MB/s
      
      More repetitions show the distinction is repeatable
      
      Reviewed By: hx235
      
      Differential Revision: D42560010
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c43ee52f1c5fe0ba3d6d6e4eebb22ded5f5492ea
      fd911f96