1. 18 3月, 2023 1 次提交
    • A
      Ignore async_io ReadOption if FileSystem doesn't support it (#11296) · eac6b6d0
      anand76 提交于
      Summary:
      In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
      
      Test Plan: Add new unit tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44045776
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
      eac6b6d0
  2. 17 3月, 2023 1 次提交
  3. 16 3月, 2023 2 次提交
    • P
      Simplify tracking entries already in SecondaryCache (#11299) · ccaa3225
      Peter Dillinger 提交于
      Summary:
      In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic.
      
      This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code.
      
      gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper.
      
      Also refactored some related test code to share common code / functionality.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299
      
      Test Plan: existing tests
      
      Reviewed By: anand1976
      
      Differential Revision: D44101453
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345
      ccaa3225
    • P
      Misc cleanup of block cache code (#11291) · 601efe3c
      Peter Dillinger 提交于
      Summary:
      ... ahead of a larger change.
      * Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache`
      * Unify naming of "standalone" block cache entries (was "detached" in clock_cache)
      * Remove some unused definitions in clock_cache.h (leftover from a previous revision)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291
      
      Test Plan: usual tests and CI, no behavior changes
      
      Reviewed By: anand1976
      
      Differential Revision: D43984642
      
      Pulled By: pdillinger
      
      fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97
      601efe3c
  4. 15 3月, 2023 1 次提交
    • H
      Fix bug of prematurely excluded CF in atomic flush contains unflushed data... · 11cb6af6
      Hui Xiao 提交于
      Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
      
      Summary:
      **Context:**
      Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
      ```
      ./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
          pid=$!
          sleep 0.2
          sleep 10
          kill $pid
          sleep 0.2
      ./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
          pid=$!
          sleep 0.2
          sleep 40
          kill $pid
          sleep 0.2
      
      Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
      Crash-recovery verification failed :(
      No writes or ops?
      Verification failed :(
      ```
      
      The bug is due to the following:
      - When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
      - But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
      - However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
      
      **Summary:**
      - Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
      - For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
      
      Test Plan:
      - New unit test failed before the fix and passes after
      - Make check
      - Rehearsal stress test
      
      Reviewed By: ajkr
      
      Differential Revision: D42799871
      
      Pulled By: hx235
      
      fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
      11cb6af6
  5. 14 3月, 2023 3 次提交
    • L
      Rename a recently added PerfContext counter (#11294) · 49881921
      Levi Tamasi 提交于
      Summary:
      The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D44035964
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
      49881921
    • P
      Document DB::Resume(), fix LockWALInEffect test (#11290) · 648e972f
      Peter Dillinger 提交于
      Summary:
      In rare cases seeing failures like this
      
      ```
      [ RUN      ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/2
      db/db_write_test.cc:653: Failure
      Put("key3", "value")
      Corruption: Not active
      ```
      
      in a test with no explicit threading. This is likely because of the unpredictability of background auto-resume. I didn't really know this feature, in part because DB::Resume() was undocumented. So I believe I have fixed the test and documented the API function.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11290
      
      Test Plan: 1000s of stress runs of the test with gtest-parallel
      
      Reviewed By: anand1976
      
      Differential Revision: D43984583
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d30dec120b4864e193751b2e33ff16834d313db3
      648e972f
    • C
      Support range deletion tombstones in `CreateColumnFamilyWithImport` (#11252) · 9aa3b6f9
      Changyu Bi 提交于
      Summary:
      CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
      1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
      2. it does not handle files with no point keys.
      
      Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).
      
      This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252
      
      Test Plan:
      - added unit tests that fails before this change
      
      Closes https://github.com/facebook/rocksdb/issues/11245
      
      Reviewed By: ajkr
      
      Differential Revision: D43577443
      
      Pulled By: cbi42
      
      fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
      9aa3b6f9
  6. 10 3月, 2023 1 次提交
    • J
      Fix compile errors in Clang due to unused variables depending on the build configuration (#11234) · 969d4e1d
      Jaepil Jeong 提交于
      Summary:
      This PR fixes compilation errors in Clang due to unused variables like the below:
      ```
      [109/329] Building CXX object CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o
      FAILED: CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o
      ccache /opt/homebrew/opt/llvm/bin/clang++ -DGFLAGS=1 -DGFLAGS_IS_A_DLL=0 -DHAVE_FULLFSYNC -DJEMALLOC_NO_DEMANGLE -DLZ4 -DOS_MACOSX -DROCKSDB_JEMALLOC -DROCKSDB_LIB_IO_POSIX -DROCKSDB_NO_DYNAMIC_EXTENSION -DROCKSDB_PLATFORM_POSIX -DSNAPPY -DTBB -DZLIB -DZSTD -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -I/Users/jaepil/app/include -I/opt/homebrew/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include/c++/v1 -W -Wextra -Wall -pthread -Wsign-compare -Wshadow -Wno-unused-parameter -Wno-unused-variable -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -Wno-strict-aliasing -Wno-invalid-offsetof -fno-omit-frame-pointer -momit-leaf-frame-pointer -march=armv8-a+crc+crypto -Wno-unused-function -Werror -O2 -g -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk -std=gnu++20 -MD -MT CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -MF CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o.d -o CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -c /Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc
      /Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc:30:10: error: variable 'recovered_edits' set but not used [-Werror,-Wunused-but-set-variable]
        size_t recovered_edits = 0;
               ^
      1 error generated.
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11234
      
      Reviewed By: cbi42
      
      Differential Revision: D43458604
      
      Pulled By: ajkr
      
      fbshipit-source-id: d8c50e1a108887b037a120cd9f19374ddaeee817
      969d4e1d
  7. 09 3月, 2023 1 次提交
    • L
      Add a PerfContext counter for merge operands applied in point lookups (#11284) · 1d524385
      Levi Tamasi 提交于
      Summary:
      The existing PerfContext counter `internal_merge_count` only tracks the
      Merge operands applied during range scans. The patch adds a new counter
      called `internal_merge_count_point_lookups` to track the same metric
      for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
      also fixes a couple of cases in the iterator where the existing counter wasn't
      updated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D43926082
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
      1d524385
  8. 07 3月, 2023 1 次提交
  9. 06 3月, 2023 1 次提交
  10. 04 3月, 2023 2 次提交
    • I
      Avoid ColumnFamilyDescriptor copy (#10978) · ddde1e6a
      Igor Canadi 提交于
      Summary:
      Hi. :) Noticed we are copying ColumnFamilyDescriptor here because my process crashed during copy constructor (cause unrelated)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10978
      
      Reviewed By: cbi42
      
      Differential Revision: D41473924
      
      Pulled By: ajkr
      
      fbshipit-source-id: 58a3473f2d7b24918f79d4b2726c20081c5e95b4
      ddde1e6a
    • C
      Improve documentation for MergingIterator (#11161) · d053926f
      Changyu Bi 提交于
      Summary:
      Add some comments to try to explain how/why MergingIterator works. Made some small refactoring, mostly in MergingIterator::SkipNextDeleted() and MergingIterator::SeekImpl().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11161
      
      Test Plan:
      crash test with small key range:
      ```
      python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=6000 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D42860994
      
      Pulled By: cbi42
      
      fbshipit-source-id: 3f0c1c9c6481a7f468bf79d823998907a8116e9e
      d053926f
  11. 02 3月, 2023 1 次提交
    • Y
      Fix backward iteration issue when user defined timestamp is enabled in BlobDB (#11258) · 8dfcfd4e
      Yu Zhang 提交于
      Summary:
      During backward iteration, blob verification would fail because the user key (ts included) in `saved_key_` doesn't match the blob. This happens because during`FindValueForCurrentKey`, `saved_key_` is not updated when the user key(ts not included) is the same for all cases except when `timestamp_lb_` is specified. This breaks the blob verification logic when user defined timestamp is enabled and `timestamp_lb_` is not specified. Fix this by always updating `saved_key_` when a smaller user key (ts included) is seen.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11258
      
      Test Plan:
      `make check`
      `./db_blob_basic_test --gtest_filter=DBBlobWithTimestampTest.IterateBlobs`
      
      Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced with:
      
      `./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5`
      
      Baseline:
      
      - seekrandom [AVG    6 runs] : 72188 (± 1481) ops/sec;   37.2 (± 0.8) MB/sec
      
      With this PR:
      
      - seekrandom [AVG    6 runs] : 74171 (± 1427) ops/sec;   38.2 (± 0.7) MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D43675642
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 8022ae8522d1f66548821855e6eed63640c14e04
      8dfcfd4e
  12. 25 2月, 2023 1 次提交
  13. 23 2月, 2023 2 次提交
    • Y
      Support iter_start_ts in integrated BlobDB (#11244) · f007b8fd
      Yu Zhang 提交于
      Summary:
      Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244
      
      Test Plan:
      ```make check
      ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
      tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```
      
      Reviewed By: ltamasi
      
      Differential Revision: D43506726
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
      f007b8fd
    • C
      Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) · 229297d1
      Changyu Bi 提交于
      Summary:
      A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).
      
      The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113
      
      Test Plan:
      * added unit test in db_range_del_test
      * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`
      
      Reviewed By: ajkr
      
      Differential Revision: D42655709
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
      229297d1
  14. 22 2月, 2023 6 次提交
    • Y
      fix -Wrange-loop-analysis in Apple clang version 12.0.0 (clang-1200.0.32.29) (#11240) · 9fa9becf
      ywave 提交于
      Summary:
      Fix complain
      ```
      db/db_impl/db_impl_compaction_flush.cc:417:19: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type
            'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis]
        for (const auto bg_flush_arg : bg_flush_args) {
                        ^
      db/db_impl/db_impl_compaction_flush.cc:417:8: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying
        for (const auto bg_flush_arg : bg_flush_args) {
             ^~~~~~~~~~~~~~~~~~~~~~~~~
                        &
      db/db_impl/db_impl_compaction_flush.cc:2911:21: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type
            'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis]
          for (const auto bg_flush_arg : bg_flush_args) {
                          ^
      db/db_impl/db_impl_compaction_flush.cc:2911:10: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying
          for (const auto bg_flush_arg : bg_flush_args) {
               ^~~~~~~~~~~~~~~~~~~~~~~~~
                          &
      ```
      from
      
      ```sh
      xxx@MacBook-Pro / % g++ -v
      Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
      Apple clang version 12.0.0 (clang-1200.0.32.29)
      Target: x86_64-apple-darwin21.6.0
      Thread model: posix
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11240
      
      Reviewed By: cbi42
      
      Differential Revision: D43458729
      
      Pulled By: ajkr
      
      fbshipit-source-id: 26e110f83451509463a1bc308f737ccb693c9f45
      9fa9becf
    • C
      Fix an assertion failure in DBIter::SeekToLast() when user-defined timestamp is enabled (#11223) · 1b48ecc2
      Changyu Bi 提交于
      Summary:
      in DBIter::SeekToLast(), key() can be called when iter is invalid and fails the following assertion:
      ```
      ./db/db_iter.h:153: virtual rocksdb::Slice rocksdb::DBIter::key() const: Assertion `valid_' failed.
      ```
      This happens when `iterate_upper_bound` and timestamp_lb_ are set. SeekForPrev(*iterate_upper_bound_) positions the iterator on the same user key as *iterate_upper_bound_. A subsequent PrevInternal() call makes the iterator invalid just be the call to key().
      
      This PR fixes this issue by setting updating the seek key to have max sequence number AND max timestamp when the seek key has the same user key as *iterate_upper_bound_.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11223
      
      Test Plan: - Added a unit test that would fail the above assertion before this fix.
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D43283600
      
      Pulled By: cbi42
      
      fbshipit-source-id: 0dd3999845b722584679bbc95be2664b266005ba
      1b48ecc2
    • L
      DBIter::FindNextUserEntryInternal: do not PrepareValue for `Delete` (#11211) · ea85148b
      leipeng 提交于
      Summary:
      `kTypeDeletion/kTypeDeletionWithTimestamp/kTypeSingleDeletion` does not need access iter value, so omit `PrepareValue`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11211
      
      Reviewed By: ajkr
      
      Differential Revision: D43253068
      
      Pulled By: cbi42
      
      fbshipit-source-id: 1945c7f8a90b6909128a0553b62d9fd1078b0a08
      ea85148b
    • H
      add c api to set option fail_if_not_bottommost_level (#11158) · 83bc03a9
      HuangYi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11158
      
      Reviewed By: cbi42
      
      Differential Revision: D42870647
      
      Pulled By: ajkr
      
      fbshipit-source-id: 1b71a1dd415c34c332cecf60c68ce37fe4393e2a
      83bc03a9
    • H
      add c api for HyperClockCache (#11110) · cfe50f7e
      HuangYi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11110
      
      Reviewed By: cbi42
      
      Differential Revision: D42660941
      
      Pulled By: ajkr
      
      fbshipit-source-id: e977d9b76dfd5d8c62335f961c275f3b810503d7
      cfe50f7e
    • M
      C-API: Support multi-CF flush (#11112) · 142b18d0
      Matt Jurik 提交于
      Summary:
      This PR adds support to the c-api bindings for calling `Flush()` with multiple column families, which is useful for performing atomic flushes (assuming also that the db has been opened with `atomic_flush = true`).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11112
      
      Reviewed By: cbi42
      
      Differential Revision: D42666382
      
      Pulled By: ajkr
      
      fbshipit-source-id: 82f05bf32d28452d85c79ea42411c8fea961fd87
      142b18d0
  15. 18 2月, 2023 2 次提交
    • M
      Remove FactoryFunc from LoadXXXObject (#11203) · b6640c31
      mrambacher 提交于
      Summary:
      The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.
      
      Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203
      
      Reviewed By: cbi42
      
      Differential Revision: D43160255
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
      b6640c31
    • A
      Merge operator failed subcode (#11231) · 25e13652
      Andrew Kryczka 提交于
      Summary:
      From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.
      
      This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231
      
      Test Plan: updated unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43396607
      
      Pulled By: ajkr
      
      fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
      25e13652
  16. 16 2月, 2023 1 次提交
    • L
      Add a new MultiGetEntity API (#11222) · 9794acb5
      Levi Tamasi 提交于
      Summary:
      The new `MultiGetEntity` API can be used to get a consistent view of
      a batch of keys, with the results presented as wide-column entities.
      Similarly to `GetEntity` and the iterator's `columns` API, if the entry
      corresponding to the key is a wide-column entity to start with, it is
      returned as-is, and if it is a plain key-value, it is wrapped into an entity
      with a single default column.
      
      Implementation-wise, the new API shares the logic of the batched `MultiGet`
      API (via the `MultiGetCommon` methods). Both single-CF and multi-CF
      `MultiGetEntity` APIs are provided, and blobs are also supported.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43256950
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005
      9794acb5
  17. 10 2月, 2023 2 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
    • P
      Attempt fix flaky DBWriteTest.LockWALInEffect (#11209) · b7747bbc
      Peter Dillinger 提交于
      Summary:
      Example failure:
      ```
      [ RUN      ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/1
      db/db_write_test.cc:646: Failure
      Put("key3", "value")
      Corruption: Not active
      ```
      Presumably from a background compaction prior to Put.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11209
      
      Test Plan: watch CI
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43147727
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a1c34ac5ab124bfe2f23205a30777990056e9082
      b7747bbc
  18. 09 2月, 2023 1 次提交
    • A
      Fix bug in WAL streaming uncompression (#11198) · 77b61abc
      anand76 提交于
      Summary:
      Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198
      
      Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.
      
      Reviewed By: ajkr, cbi42
      
      Differential Revision: D43102692
      
      Pulled By: anand1976
      
      fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd
      77b61abc
  19. 08 2月, 2023 1 次提交
    • L
      Add compaction filter support for wide-column entities (#11196) · 876d2815
      Levi Tamasi 提交于
      Summary:
      The patch adds compaction filter support for wide-column entities by introducing
      a new `CompactionFilter` API called `FilterV3`. This API is called for regular
      key-values, merge operands, and wide-column entities as well. It is passed the
      existing value/operand or wide-column structure and it can update the value or
      columns or keep/delete/etc. the key-value as usual. For compatibility, the default
      implementation of `FilterV3` keeps all wide-column entities and falls back to calling
      `FilterV2` for plain old key-values and merge operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43094147
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd
      876d2815
  20. 07 2月, 2023 1 次提交
  21. 04 2月, 2023 2 次提交
    • P
      Deprecate write_global_seqno and default to false (#11179) · 0cf1008f
      Peter Dillinger 提交于
      Summary:
      This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way.
      
      Also improved API comments.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179
      
      Test Plan: existing tests (with one tiny update)
      
      Reviewed By: hx235
      
      Differential Revision: D42973927
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab
      0cf1008f
    • P
      Ensure LockWAL() stall cleared for UnlockWAL() return (#11172) · 390cc0b1
      Peter Dillinger 提交于
      Summary:
      Fixes https://github.com/facebook/rocksdb/issues/11160
      
      By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared.
      
      More details in code comments and new unit test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172
      
      Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that *should* do the right thing, without checking that it *does* the right thing.
      
      Reviewed By: ajkr
      
      Differential Revision: D42894341
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3
      390cc0b1
  22. 03 2月, 2023 1 次提交
    • A
      Return any errors returned by ReadAsync to the MultiGet caller (#11171) · 63da9cfa
      anand76 提交于
      Summary:
      Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.
      
      Tests:
      Update existing unit tests and add a new one for this scenario
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42950057
      
      Pulled By: anand1976
      
      fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869
      63da9cfa
  23. 02 2月, 2023 1 次提交
    • L
      Clean up InvokeFilterIfNeeded a bit (#11174) · df680b24
      Levi Tamasi 提交于
      Summary:
      The patch makes some code quality enhancements in `CompactionIterator::InvokeFilterIfNeeded`
      including the renaming of `filter` (which is most likely a remnant of the days before the `FilterV2`
      API when the compaction filter used to return a boolean) to `decision`, the removal of some
      outdated comments, the elimination of an `error` flag which was only used in one failure case
      out of many, as well as some small stylistic improvements. (Some the above will also come in
      handy when adding compaction filter support for wide-column entities.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11174
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42901408
      
      Pulled By: ltamasi
      
      fbshipit-source-id: ab382d59a4990c5dfe1cee219d49e1d80902b666
      df680b24
  24. 01 2月, 2023 2 次提交
  25. 31 1月, 2023 2 次提交
    • P
      Cleanup, improve, stress test LockWAL() (#11143) · 94e3beec
      Peter Dillinger 提交于
      Summary:
      The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
      
      * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
      * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
      * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
      * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
      * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
      * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
      
      Test Plan: unit tests added / updated, added to stress/crash test
      
      Reviewed By: ajkr
      
      Differential Revision: D42848627
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
      94e3beec
    • Y
      Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8
      Yu Zhang 提交于
      Summary:
      Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
      
      Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
      
      Test Plan:
      make V=1 db_blob_basic_test
      ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
      
      Reviewed By: ltamasi
      
      Differential Revision: D42716487
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
      24ac53d8