1. 02 5月, 2023 1 次提交
    • A
      Shard JemallocNodumpAllocator (#11400) · 925d8252
      Andrew Kryczka 提交于
      Summary:
      RocksDB's jemalloc no-dump allocator (`NewJemallocNodumpAllocator()`) was using a single manual arena. This arena's lock contention could be very high when thread caching is disabled for RocksDB blocks (e.g., when using `MALLOC_CONF='tcache_max:4096'` and `rocksdb_block_size=16384`).
      
      This PR changes the jemalloc no-dump allocator to use a configurable number of manual arenas. That number is required to be a power of two so we can avoid division. The allocator shards allocation requests randomly across those manual arenas.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11400
      
      Test Plan:
      - mysqld setup
        - Branch: fb-mysql-8.0.28 (https://github.com/facebook/mysql-5.6/commit/653eba2e56cfba4eac0c851ac9a70b2da9607527)
        - Build: `mysqlbuild.sh --clean --release`
        - Set env var `MALLOC_CONF='tcache_max:$tcache_max'`
        - Added CLI args `--rocksdb_cache_dump=false --rocksdb_block_cache_size=4294967296 --rocksdb_block_size=16384`
        - Ran under /usr/bin/time
      - Large database scenario
        - Setup command: `mysqlslap -h 127.0.0.1 -P 13020 --auto-generate-sql=1 --auto-generate-sql-load-type=write --auto-generate-sql-guid-primary=1 --number-char-cols=8 --auto-generate-sql-execute-number=262144 --concurrency=32 --no-drop`
        - Benchmark command: `mysqlslap -h 127.0.0.1 -P 13020 --query='select count(*) from mysqlslap.t1;' --number-of-queries=320 --concurrency=32`
        - Results:
      
      | tcache_max | num_arenas | Peak RSS MB (% change) | Query latency seconds (% change) |
      |---|---|---|---|
      | 4096 | **(baseline)** | 4541 | 37.1 |
      | 4096 | 1 | 4535 (-0.1%) | 36.7 (-1%) |
      | 4096 | 8 | 4687 (+3%) | 10.2 (-73%) |
      | 16384 | **(baseline)** | 4514 | 8.4 |
      | 16384 | 1 | 4526 (+0.3%) | 8.5 (+1%) |
      | 16384 | 8 | 4580 (+1%) | 8.5 (+1%) |
      
      Reviewed By: pdillinger
      
      Differential Revision: D45220794
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9a50c9872bdef5d299e52b115a65ee8a5557d58d
      925d8252
  2. 29 4月, 2023 1 次提交
    • L
      Deflake some old BlobDB test cases (#11417) · d3ed7968
      Levi Tamasi 提交于
      Summary:
      The old `StackableDB` based BlobDB implementation relies on a DB listener to track the total size of the SST files in the database and to trigger FIFO eviction. Some test cases in `BlobDBTest` assume that the listener is notified by the time `DB::Flush` returns, which is not guaranteed (side note: `TEST_WaitForFlushMemTable` would not guarantee this either). The patch fixes these tests by using `SyncPoint`s to make sure the listener is actually called before verifying the FIFO behavior.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11417
      
      Test Plan:
      ```
      make -j56 COERCE_CONTEXT_SWITCH=1 blob_db_test
      ./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_TriggerOnSSTSizeChange
      ./blob_db_test --gtest_filter=BlobDBTest.FilterForFIFOEviction
      ./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_NoEnoughBlobFilesToEvict
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D45407135
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fcd63d76937d2c975f569a6635ce8730772a3d75
      d3ed7968
  3. 26 4月, 2023 2 次提交
    • C
      Block per key-value checksum (#11287) · 62fc15f0
      Changyu Bi 提交于
      Summary:
      add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
      1. checksum construction and verification in block.cc/h
      2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
      3. unit tests/crash test updates
      
      Tests:
      * Added unit tests
      * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
      
      Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
      
      Performance:
      Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
      For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
      
      ```
      SETUP
      make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
      ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
      
      BENCHMARK
      ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
      
      The readrandom ops/sec looks like the following:
      Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
      Main              240805     223604         198176           161653       139040
      PR prot_bytes=0   238691     226693         200127           161082       141153
      PR prot_bytes=1   214983     193199         178532           137013       108211
      prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
      prot_bytes=0
      ```
      
      The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
      
      Reviewed By: ajkr
      
      Differential Revision: D43970708
      
      Pulled By: cbi42
      
      fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
      62fc15f0
    • L
      DBImpl::MultiGet: delete unused var `superversions_to_delete` (#11395) · 40d69b59
      leipeng 提交于
      Summary:
      In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395
      
      Reviewed By: ajkr
      
      Differential Revision: D45240896
      
      Pulled By: cbi42
      
      fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31
      40d69b59
  4. 25 4月, 2023 3 次提交
    • H
      Add back io_uring stress test hack with DbStressFSWrapper for FS not supporting read async (#11404) · 3622cfa3
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      To better utilize `DbStressFSWrapper` for some assertion, https://github.com/facebook/rocksdb/pull/11288 removed an io_uring stress test hack for POSIX FS not supporting read async added in https://github.com/facebook/rocksdb/pull/11242 = It was removed based on the assumption that a later PR https://github.com/facebook/rocksdb/pull/11296 is sufficient to serve as an alternative workaround.
      
      But recent stress tests has shown the opposite, mostly because 11296  approach might be subjected to incompleteness when more `ReadOptions` are passed down as what https://github.com/facebook/rocksdb/pull/11288 has done.
      
      As a short-term solution to both work around POSIX FS constraint above and utilize `DbStressFSWrapper` for 11288 assertion, I proposed this PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11404
      
      Test Plan:
      - Stress test ensures 11288's assertion is still effective in `DbStressFSWrapper`
      ```
      ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=disable --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=32767 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=$exp --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=1 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
      ```
      - Monitor future stress test to show `MultiGet error: Not implemented: ReadAsync` is gone
      
      Reviewed By: ltamasi
      
      Differential Revision: D45242280
      
      Pulled By: hx235
      
      fbshipit-source-id: 9823e3fbd4e9672efdd31478a2f2cbd68a98bdf5
      3622cfa3
    • P
      Start version 8.3 (#11405) · 46dbcfd7
      Peter Dillinger 提交于
      Summary:
      Update and clean up history. Update version number. Add to compatibility test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11405
      
      Reviewed By: ltamasi
      
      Differential Revision: D45242779
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 860bd047584d051472ba9ccefae7ebc3f37b1d8f
      46dbcfd7
    • P
      Fix compression tests^2 (#11403) · a2c1f573
      Peter Dillinger 提交于
      Summary:
      This time a particular version of bzip2 is under-compressing vs. expectation in BlockBasedTableTest.CompressionRatioThreshold. We'll exempt that algorithm like I did for DBStatisticsTest.CompressionStatsTest.
      
      https://app.circleci.com/pipelines/github/facebook/rocksdb/26869/workflows/a46246db-73c7-4946-af82-10a78a7df6af/jobs/596124
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11403
      
      Test Plan: CI
      
      Reviewed By: ltamasi
      
      Differential Revision: D45233441
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 506c8dfe5e0397c78193359df6288397bf0667c9
      a2c1f573
  5. 23 4月, 2023 1 次提交
  6. 22 4月, 2023 3 次提交
    • P
      Changes and enhancements to compression stats, thresholds (#11388) · d79be3dc
      Peter Dillinger 提交于
      Summary:
      ## Option API updates
      * Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
      * Remove unnecessary constructor for CompressionOptions.
      * Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).
      
       ## Stat API updates
      * Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
      * Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
      * New or existing tickers relevant to compression:
        * BYTES_COMPRESSED_FROM
        * BYTES_COMPRESSED_TO
        * BYTES_COMPRESSION_BYPASSED
        * BYTES_COMPRESSION_REJECTED
        * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
        * NUMBER_BLOCK_COMPRESSED (existing)
        * NUMBER_BLOCK_COMPRESSION_BYPASSED
        * NUMBER_BLOCK_COMPRESSION_REJECTED
        * BYTES_DECOMPRESSED_FROM
        * BYTES_DECOMPRESSED_TO
      
      We can compute a number of things with these stats:
      * "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
      * Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
      * Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
      * Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES
      
      Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.
      
       ## BlockBasedTableBuilder
      Various small refactorings, optimizations, and name clean-ups.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388
      
      Test Plan:
      unit tests added
      
      * `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.
      
      Reviewed By: ajkr
      
      Differential Revision: D45128202
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
      d79be3dc
    • C
      Improve error message from `SanityCheckCFOptions()` for merge_operator (#11393) · adc9001f
      Changyu Bi 提交于
      Summary:
      This happens when the persisted merge operator not a RocksDB built-in one. This PR improves this error message to include the actual persisted merge operator name. when there is a merge_operator mismatch in `SanityCheckCFOptions()`, for example, going from merge operator "CustomMergeOp" to nullptr, an error message like the following is returned:
      
      "failed the verification on ColumnFamilyOptions::merge_operator--- The specified one is nullptr while the **persisted one is nullptr**."
      
      This happens when the persisted merge operator not a RocksDB built-in one. This PR improves this error message to include the actual persisted merge operator name.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11393
      
      Test Plan: add unit test to check error message when going from merge op -> nullptr and going from merge op1 to merge op 2.
      
      Reviewed By: ajkr
      
      Differential Revision: D45190131
      
      Pulled By: cbi42
      
      fbshipit-source-id: 67712c2fec29c654c15166d1be985e710e6081e5
      adc9001f
    • H
      Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) · 151242ce
      Hui Xiao 提交于
      Summary:
      **Context:**
      The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
      
      **Summary**
      - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
         - Fixed the default histogram in `RandomAccessFileReader`
      - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
      - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
      
      Test Plan:
      - **Stress test**
      - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
           - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
      ```
      ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
      ```
      ```
      // BlockBasedTable
      rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
      rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
      rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
      
      // PlainTable
      Does not apply
      ```
      - **Db bench 2: performance**
      
      **Read**
      
      SETUP: db with 900 files
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
      ```run till convergence
      ```
      ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
      ```
      Pre-change
      `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
      Post-change (no regression, -0.3%)
      `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
      
      **Compaction/Flush**run till convergence
      ```
      ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
      
      rocksdb.sst.read.micros  COUNT : 33820
      rocksdb.sst.read.flush.micros COUNT : 1800
      rocksdb.sst.read.compaction.micros COUNT : 32020
      ```
      Pre-change
      `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Post-change (no regression, ~-0.4%)
      `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Reviewed By: ajkr
      
      Differential Revision: D44007011
      
      Pulled By: hx235
      
      fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
      151242ce
  7. 21 4月, 2023 3 次提交
    • A
      Clarify `SstFileWriter::DeleteRange()` ordering requirements (#11390) · 0a774a10
      Andrew Kryczka 提交于
      Summary:
      As titled
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11390
      
      Reviewed By: cbi42
      
      Differential Revision: D45148830
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9a8dfd040514bae3d8ed9e97a79cae7683f2749a
      0a774a10
    • A
      Fix race condition in db_stress checkpoint cleanup (#11389) · 6cac4c79
      Andrew Kryczka 提交于
      Summary:
      The old cleanup code had a race condition:
      
      1. Test thread: DestroyDB() marked a file as trash
      2. DeleteScheduler thread: Got the file's size and decided to delete it in chunks
      3. Test thread: DestroyDir() deleted that trash file
      4. DeleteScheduler thread: Began deleting in chunks starting by calling ReopenWritableFile(). Unfortunately this recreates the deleted trash file
      5. Test thread: DestroyDir() fails to remove the parent directory because it contains the file created in 4.
      6. Test thread: Checkpoint::Create() fails due to the directory already existing
      
      It could be repro'd with the following patch/command.
      
      Patch:
      
      ```
       diff --git a/file/delete_scheduler.cc b/file/delete_scheduler.cc
      index 8a2d1615d..337d24a60 100644
       --- a/file/delete_scheduler.cc
      +++ b/file/delete_scheduler.cc
      @@ -317,6 +317,12 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash,
                                                  &num_hard_links, nullptr);
             if (my_status.ok()) {
               if (num_hard_links == 1) {
      +          // Give some time for DestroyDir() to delete file entries. Then, the
      +          // below `ReopenWritableFile()` will recreate files, preventing the
      +          // parent directory from being deleted.
      +          if (rand() % 2 == 0) {
      +            usleep(1000);
      +          }
                 std::unique_ptr<FSWritableFile> wf;
                 my_status = fs_->ReopenWritableFile(path_in_trash, FileOptions(), &wf,
                                                     nullptr);
       diff --git a/file/file_util.cc b/file/file_util.cc
      index 43608fcdc..2cee1ad8e 100644
       --- a/file/file_util.cc
      +++ b/file/file_util.cc
      @@ -263,6 +263,13 @@ Status DestroyDir(Env* env, const std::string& dir) {
           }
         }
      
      +  // Give some time for the DeleteScheduler thread's ReopenWritableFile() to
      +  // recreate deleted files
      +  if (dir.find("checkpoint") != std::string::npos) {
      +    fprintf(stderr, "waiting to destroy %s\n", dir.c_str());
      +    usleep(10000);
      +  }
      +
         if (s.ok()) {
           s = env->DeleteDir(dir);
           // DeleteDir might or might not report NotFound
      ```
      
      Command:
      
      ```
      TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=131072 --target_file_size_base=131072 --max_bytes_for_level_base=524288 --checkpoint_one_in=100 --clear_column_family_one_in=0  --max_key=1000 --value_size_mult=33 --sst_file_manager_bytes_per_truncate=4096 --sst_file_manager_bytes_per_sec=1048576  --interval=3 --compression_type=none --sync_fault_injection=1
      ```
      
      Obviously we don't want to use scheduled deletion here as we need the checkpoint directory deleted immediately. I suspect the DestroyDir() was an attempt to fixup incomplete DestroyDB()s. Now that we expect DestroyDB() to be complete I removed that code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11389
      
      Reviewed By: hx235
      
      Differential Revision: D45137142
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2af743d342c77cc414fd25fc4c9d7c9c6079ad24
      6cac4c79
    • C
      Always allow L0->L1 trivial move during manual compaction (#11375) · 43e9a60b
      Changyu Bi 提交于
      Summary:
      during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
      - before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
      - after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
      Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
      
      This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
      
      Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
      
      Reviewed By: ajkr
      
      Differential Revision: D44943518
      
      Pulled By: cbi42
      
      fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
      43e9a60b
  8. 19 4月, 2023 1 次提交
  9. 18 4月, 2023 3 次提交
    • P
      Update GeneralTableTest::ApproximateOffsetOfCompressed values (#11384) · 9b698cda
      Peter Dillinger 提交于
      Summary:
      Because of this failure with snappy 1.1.8, ROCKSDB_NO_FBCODE=1
      
      ```
      Value 3531 is not in range [2000, 3525]
      table/table_test.cc:4231: Failure
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11384
      
      Test Plan: run updated test in failing configuration
      
      Reviewed By: ajkr
      
      Differential Revision: D45057161
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 397054f08033315e2e2bd9410f1fa32ddbf3b9c8
      9b698cda
    • A
      Deflake DBWriteTest.LockWALInEffect (#11382) · f3818948
      Andrew Kryczka 提交于
      Summary:
      This test exhibited the following flaky failure:
      
      ```
      db/db_write_test.cc:653: Failure
      db_->Resume()
      Corruption: Not active
      ```
      
      I was able to repro it by applying the following patch to coerce a specific race condition:
      
      ```
       diff --git a/db/db_write_test.cc b/db/db_write_test.cc
      index d82c57376..775ba3cde 100644
       --- a/db/db_write_test.cc
      +++ b/db/db_write_test.cc
      @@ -636,6 +636,10 @@ TEST_P(DBWriteTest, LockWALInEffect) {
         ASSERT_TRUE(dbfull()->WALBufferIsEmpty());
         ASSERT_OK(db_->UnlockWAL());
      
      +  // Test thread: sleep interval: [0, 3)
      +  // In this interval, the file system is active
      +  sleep(3);
      +
         // Fail the WAL flush if applicable
         fault_fs->SetFilesystemActive(false);
         Status s = Put("key2", "value");
      @@ -649,6 +653,11 @@ TEST_P(DBWriteTest, LockWALInEffect) {
           ASSERT_OK(db_->LockWAL());
           ASSERT_OK(db_->UnlockWAL());
         }
      +
      +  // Test thread: sleep interval: [3, 6)
      +  // In this interval, the file system is inactive
      +  sleep(3);
      +
         fault_fs->SetFilesystemActive(true);
         ASSERT_OK(db_->Resume());
         // Writes should work again
       diff --git a/db/flush_job.cc b/db/flush_job.cc
      index 8193f594f..602ee2c9f 100644
       --- a/db/flush_job.cc
      +++ b/db/flush_job.cc
      @@ -979,6 +979,10 @@ Status FlushJob::WriteLevel0Table() {
                 DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
           }
           TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table", &mems_);
      +    // Flush thread: sleep interval: [0, 4)
      +    // Upon awakening, the file system will be inactive. Then the MANIFEST
      +    // update will fail.
      +    sleep(4);
           db_mutex_->Lock();
         }
         base_->Unref();
      ```
      
      The fix for this scenario is explained in the code change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11382
      
      Reviewed By: cbi42
      
      Differential Revision: D45027632
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6bfa35a5781c0c080fb74e13f2b2c9f871f7effb
      f3818948
    • A
      Deflake DBBloomFilterTest.OptimizeFiltersForHits (#11383) · b8555ba4
      Andrew Kryczka 提交于
      Summary:
      In CircleCI build-linux-arm-test-full job (https://app.circleci.com/pipelines/github/facebook/rocksdb/26462/workflows/a9d39d2c-c970-4b0f-9c10-7743beb9771b/jobs/591722), this test exhibited the following flaky failure:
      
      ```
      db/db_bloom_filter_test.cc:2506: Failure
      Expected: (TestGetTickerCount(options, BLOOM_FILTER_USEFUL)) > (65000 * 2), actual: 120558 vs 130000
      ```
      
      I ssh'd to an instance and observed it cuts memtables at slightly different points across runs. Logging in `ConcurrentArena` pointed to `try_lock()` returning false at different points across runs.
      
      This PR changes the approach to allow a fixed number of keys per memtable flush. I verified the bloom filter useful count is deterministic now even on the CircleCI ARM instance.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11383
      
      Reviewed By: cbi42
      
      Differential Revision: D45036829
      
      Pulled By: ajkr
      
      fbshipit-source-id: b602dacb63955f1af09bf0ed409cde0552805a08
      b8555ba4
  10. 16 4月, 2023 2 次提交
  11. 15 4月, 2023 2 次提交
    • C
      Try to pick more files in `LevelCompactionBuilder::TryExtendNonL0TrivialMove()` (#11347) · ba16e8ee
      Changyu Bi 提交于
      Summary:
      Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left.
      
      Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347
      
      Test Plan:
      * new unit test
      * Benchmark does not show obvious improvement or regression:
      ```
      Write sequentially
      ./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216
      
      Main:
      fillseq      :       4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations;   23.4 MB/s
      This PR:
      fillseq      :       4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations;   23.3 MB/s
      
      Write randomly
      ./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216
      
      Main:
      fillrandom   :      16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations;    6.8 MB/s
      This PR:
      fillrandom   :      15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations;    7.0 MB/s
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D44645650
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733
      ba16e8ee
    • M
      Fix serval bugs in ImportColumnFamilyTest (#11372) · 9500d90d
      mayue.fight 提交于
      Summary:
      **Context/Summary:**
      ASSERT_EQ will only verify the code of Status, but will not check the state message of Status.
      
      - Assert by checking Status state in `ImportColumnFamilyTest`
      - Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372
      
      Reviewed By: ajkr
      
      Differential Revision: D45004343
      
      Pulled By: cbi42
      
      fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b
      9500d90d
  12. 13 4月, 2023 1 次提交
    • J
      util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (#11371) · 6b67b561
      Jeff Palm 提交于
      Summary:
      util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (and enable checking for the error)
      
      Code would produce errors like this, when compiled with -Wambiguous-reversed-operator under c++20.
      ```
      util/ribbon_test.cc:695:20: error: ISO C++20 considers use of overloaded operator '!=' (with operand types 'KeyGen' (aka '(anonymous namespace)::StandardKeyGen') and 'KeyGen') to be ambiguou
      s despite there being a unique best viable function with non-reversed arguments [-Werror,-Wambiguous-reversed-operator]
              while (cur != batch_end) {
                     ~~~ ^  ~~~~~~~~~
      util/ribbon_test.cc:111:8: note: candidate function with non-reversed arguments
        bool operator!=(const StandardKeyGen& other) {
             ^
      util/ribbon_test.cc:107:8: note: ambiguous candidate function with reversed arguments
        bool operator==(const StandardKeyGen& other) {
             ^
      ```
      
      This will become a hard error in future standards.
      
      Confirmed that no errors were generated when building using clang and c++20:
      ```
      USE_CLANG=1 USE_COROUTINES=1 make
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11371
      
      Reviewed By: meyering
      
      Differential Revision: D44921027
      
      Pulled By: cbi42
      
      fbshipit-source-id: ef25b78260920a4d75a718310688d3a2487ffa87
      6b67b561
  13. 12 4月, 2023 1 次提交
    • Y
      Initial add UDT in memtable only option (#11362) · 647cd736
      Yu Zhang 提交于
      Summary:
      This option is immutable through the life time of the DB open. For now, updating its value between different DB open sessions is also a non compatible change. When I work on support for updating comparator, the type of updates accepted for this option will be supported then.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11362
      
      Test Plan: `make check`
      
      Reviewed By: ltamasi
      
      Differential Revision: D44873870
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: aa02094754b58d99abf9af4c9a8108c1350254cb
      647cd736
  14. 11 4月, 2023 2 次提交
    • A
      fix optimization-disabled test builds with platform010 (#11361) · 760b773f
      Andrew Kryczka 提交于
      Summary:
      Fixed the following failure:
      
      ```
      third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc: In function ‘bool testing::internal::StackGrowsDown()’:
      third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8681:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized]
       8681 |   StackLowerThanAddress(&dummy, &result);
            |   ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
      third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8671:13: note: by argument 1 of type ‘const void*’ to ‘void testing::internal::StackLowerThanAddress(const void*, bool*)’ declared here
       8671 | static void StackLowerThanAddress(const void* ptr, bool* result) {
            |             ^~~~~~~~~~~~~~~~~~~~~
      third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8679:7: note: ‘dummy’ declared here
       8679 |   int dummy;
            |       ^~~~~
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11361
      
      Reviewed By: cbi42
      
      Differential Revision: D44838033
      
      Pulled By: ajkr
      
      fbshipit-source-id: 27d68b5a24a15723bbaaa7de45ccd70a60fe259e
      760b773f
    • N
      C-API: Constify cache functions where possible (#11243) · d5a9c0c9
      Niklas Fiekas 提交于
      Summary:
      Makes it easier to use generated Rust bindings. Constness of these is already part of the C++ API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11243
      
      Reviewed By: hx235
      
      Differential Revision: D44840394
      
      Pulled By: ajkr
      
      fbshipit-source-id: bcd1aeb8c959c304148d25b00043bb8c4cd3e0a4
      d5a9c0c9
  15. 08 4月, 2023 7 次提交
    • Z
      fix bad implementation of ShardedCache::GetOccupancyCount (#11325) · c8552d8c
      Zdenek Korcak 提交于
      Summary:
      copy paste typo
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11325
      
      Reviewed By: hx235
      
      Differential Revision: D44378512
      
      Pulled By: ajkr
      
      fbshipit-source-id: 509ed2697c06eed975914359ece0459a0ea40312
      c8552d8c
    • N
      Add PaxosStore to USERS (#11357) · d30bb3d1
      nccx 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11357
      
      Reviewed By: hx235
      
      Differential Revision: D44774454
      
      Pulled By: ajkr
      
      fbshipit-source-id: f3912316b6cd4e0b41310590c93f914f1d943044
      d30bb3d1
    • L
      Makefile: fix a typo: PLATFORM_CFLAGS to PLATFORM_CCFLAGS (#11348) · b2c4bc5f
      leipeng 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11348
      
      Reviewed By: hx235
      
      Differential Revision: D44774863
      
      Pulled By: ajkr
      
      fbshipit-source-id: ba4bd959650228a71fca6bf62840ae9d7373d6f0
      b2c4bc5f
    • N
      Remove deprecated integration tests from README.md (#11354) · 140dd93b
      nccx 提交于
      Summary:
      The CI systems other than CircleCI are almost always in a failing state. Since CircleCI covers linux, macos, and windows, we can remove the others.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11354
      
      Reviewed By: hx235
      
      Differential Revision: D44774627
      
      Pulled By: ajkr
      
      fbshipit-source-id: c83b298ec5afe4ea410744eda6cc98fc6a3365f1
      140dd93b
    • C
      Initialize `lowest_unnecessary_level_` in `VersionStorageInfo` constructor (#11359) · 64cead91
      Changyu Bi 提交于
      Summary:
      valgrind complains "Conditional jump or move depends on uninitialised value(s)". A sample error message:
      
      ```
      [ RUN      ] DBCompactionTest.DrainUnnecessaryLevelsAfterDBBecomesSmall
      ==3353864== Conditional jump or move depends on uninitialised value(s)
      ==3353864==    at 0x8647B4: rocksdb::VersionStorageInfo::ComputeCompactionScore(rocksdb::ImmutableOptions const&, rocksdb::MutableCFOptions const&) (version_set.cc:3414)
      ==3353864==    by 0x86B340: rocksdb::VersionSet::AppendVersion(rocksdb::ColumnFamilyData*, rocksdb::Version*) (version_set.cc:4946)
      ==3353864==    by 0x876B88: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6876)
      ==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
      ==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
      ==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
      ==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
      ==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
      ==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
      ==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
      ==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
      ==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
      ==3353864==  Uninitialised value was created by a heap allocation
      ==3353864==    at 0x4842F0F: operator new(unsigned long) (vg_replace_malloc.c:422)
      ==3353864==    by 0x876AF4: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6870)
      ==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
      ==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
      ==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
      ==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
      ==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
      ==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
      ==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
      ==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
      ==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
      ==3353864==    by 0x591F73: rocksdb::DBTestBase::Reopen(rocksdb::Options const&) (db_test_util.cc:662)
      ```
      
      This is likely about `lowest_unnecessary_level_` even though it would be initialized in `CalculateBaseBytes()` before being used in `ComputeCompactionScore()`. Initialize it also in VersionStorageInfo constructor to prevent valgrind from  complaining.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11359
      
      Test Plan: - ran a test with valgrind which gave the error message above before this PR: `valgrind --track-origins=yes ./db_compaction_test  --gtest_filter="*DrainUnnecessaryLevelsAfterDBBecomesSmall*"`
      
      Reviewed By: hx235
      
      Differential Revision: D44799112
      
      Pulled By: cbi42
      
      fbshipit-source-id: 557208a66f04a2163b418b2a651bdb7e777c4511
      64cead91
    • P
      Refactor block cache tracing w/improved MultiGet (#11339) · f9db0c6e
      Peter Dillinger 提交于
      Summary:
      After https://github.com/facebook/rocksdb/issues/11301, I wasn't sure whether I had regressed block cache tracing with MultiGet. Demo PR https://github.com/facebook/rocksdb/issues/11330 shows the flawed state of tracing MultiGet before my change, and based on the unit test, there was essentially no change in tracing behavior with https://github.com/facebook/rocksdb/issues/11301. This change is to leave that code and behavior better than I found it.
      
      This change is not intended to change any production behaviors except when block cache tracing is active, though might improve general read path efficiency by disabling some related tracking when such tracing is disabled.
      
      More detail on production code:
      * Refactoring to consolidate the construction of BlockCacheTraceRecord, and other related functionality, in block-based table reader, though it's somewhat awkward to preserve an optimization to avoid copying Slices into temporary strings in BlockCacheLookupContext.
      * Accurately track cache hits and misses (etc.) for each data block accessed by a MultiGet(). (Previously reported hits as misses.)
      * Reduced repeated checking of `block_cache_tracer_` state (by creating lookup_context only when active) for efficiency and to reduce the risk of corner case bugs where tracing is enabled or disabled for different parts of a read op. (See a TODO below)
      * Improved estimate calculation for num_keys_in_block (see code comment)
      
      Possible follow-up:
      * `XXX:` use_cache=true means double cache query? (possible double-query of block cache when allow_mmap_reads=true)
      * `TODO:` need more than one lookup_context here to track individual filter and index partition hits and misses
      * `TODO:` optimize more state checks of `block_cache_tracer_` down to `lookup_context != nullptr`
      * Pre-existing `XXX:` There appear to be 'break' statements above that bypass this writing of the block cache trace record
      * Expand test coverage (see below)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11339
      
      Test Plan:
      * Added a basic unit test for block cache tracing MultiGet, for now just covering one data block with two keys.
      * Added HitMissCountingCache to independently verify that the actual block cache trace and expected block cache trace also agree with the actual number of cache hits / misses (nothing missing or mislabeled). For now only used with MultiGet test.
      * Better testing of num_keys_in_block, for now just with MultiGet
      * Misc improvements to table_test to improve clarity, such as making it clear that certain keys are auto-inserted at the start of every test.
      
      Performance test:
      Testing multireadrandom as in https://github.com/facebook/rocksdb/issues/11301, except averaging over distinct runs rather than [-X30] which doesn't seem to sufficiently reset after each run to work as an independent test run.
      
      Base with revert of 11301: 3148926 ops/sec
      Base: 3019146 ops/sec
      New: 2999529 ops/sec
      
      Possibly a tiny MultiGet CPU regression with this change. We are now always allocating an additional vector for the LookupContexts. I'm still contemplating options to try to correct the regression in https://github.com/facebook/rocksdb/issues/11301.
      
      Testing readrandom:
      Base with revert of 11301: 2311988
      Base: 2281726
      New: 2299722
      
      Possibly a tiny Get CPU improvement with this change. We are now avoiding some unnecessary LookupContext population.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44557845
      
      Pulled By: pdillinger
      
      fbshipit-source-id: b841691799d2a48fb59cc8880dc7cbb1e107ae3d
      f9db0c6e
    • C
      Better support for merge operation with data block hash index (#11356) · f631138e
      Changyu Bi 提交于
      Summary:
      when data block hash index finds a key of op_type `kTypeMerge`, do not redo data block seek.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11356
      
      Test Plan:
      - added new unit test
      - crash test: `python3 tools/db_crashtest.py whitebox --simple --use_merge=1 --data_block_index_type=1`
      - benchmark see slight improvement in read throughput:
      ```
      TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=mergerandom -use_existing_db=false -num=10000000 -compression_type=none -level_compaction_dynamic_level_bytes=1 -merge_operator=PutOperator -write_buffer_size=1000000 --use_data_block_hash_index=1
      
      TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=readrandom[-X10] -use_existing_db=true -num=10000000 -merge_operator=PutOperator -readonly=1 -disable_auto_compactions=1 -reads=100000
      
      Main: readrandom [AVG 10 runs] : 29526 (± 1118) ops/sec;    2.1 (± 0.1) MB/sec
      Post-PR: readrandom [AVG 10 runs] : 31095 (± 662) ops/sec;    2.2 (± 0.0) MB/sec
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D44759895
      
      Pulled By: cbi42
      
      fbshipit-source-id: 387f0c35938c7e0e96b810ca3babf1967fc68191
      f631138e
  16. 07 4月, 2023 2 次提交
    • W
      Filter table files by timestamp: Get operator (#11332) · 0578d9f9
      Wentian Guo 提交于
      Summary:
      If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332
      
      Reviewed By: pdillinger
      
      Differential Revision: D44763497
      
      Pulled By: guowentian
      
      fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
      0578d9f9
    • C
      Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340) · b3c43a5b
      Changyu Bi 提交于
      Summary:
      When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340
      
      Test Plan:
      - New unit tests
      - `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.
      
      Reviewed By: ajkr
      
      Differential Revision: D44563884
      
      Pulled By: cbi42
      
      fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
      b3c43a5b
  17. 06 4月, 2023 2 次提交
    • A
      Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328) · 0623c5b9
      anand76 提交于
      Summary:
      VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328
      
      Test Plan: Add a unit test
      
      Reviewed By: pdillinger
      
      Differential Revision: D44718781
      
      Pulled By: anand1976
      
      fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
      0623c5b9
    • H
      Fix initialization-order-fiasco in write_stall_stats.cc (#11355) · 7f5b9f40
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As title.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11355
      
      Test Plan:
      - Ran previously failed tests and they succeed
      - Perf
      `./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`
      
      Reviewed By: ajkr
      
      Differential Revision: D44719333
      
      Pulled By: hx235
      
      fbshipit-source-id: 23d22f314144071d97f7106ff1241c31c0bdf08b
      7f5b9f40
  18. 05 4月, 2023 3 次提交
    • A
      Use user-provided ReadOptions for metadata block reads more often (#11208) · b4573862
      Andrew Kryczka 提交于
      Summary:
      This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information.
      
      I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously.
      
      Fixes https://github.com/facebook/rocksdb/issues/10463
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208
      
      Test Plan:
      Functional:
      - Measured `-verify_checksum=false` applies to metadata blocks read outside of table open
        - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56`
        - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true`
        - before: `rocksdb.block.checksum.compute.count COUNT : 384353`
        - after: `rocksdb.block.checksum.compute.count COUNT : 22`
      
      Performance:
      - Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1`
      - Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads
        - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000`
        - Before: `readrandom [AVG    20 runs] : 274848 (± 3717) ops/sec;    8.4 (± 0.1) MB/sec`
        - After: `readrandom [AVG    20 runs] : 277904 (± 4474) ops/sec;    8.5 (± 0.1) MB/sec`
      
      Reviewed By: hx235
      
      Differential Revision: D43145366
      
      Pulled By: ajkr
      
      fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994
      b4573862
    • P
      Re-clarify SecondaryCache API (#11316) · 03ccb1cd
      Peter Dillinger 提交于
      Summary:
      I previously misread or misinterpreted API contracts for SecondaryCache and this should correct the record. (Follow-up item from https://github.com/facebook/rocksdb/issues/11301)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11316
      
      Test Plan: comments only
      
      Reviewed By: anand1976
      
      Differential Revision: D44245107
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3f8ddec150674b75728f1730f99b963bbf7b76e7
      03ccb1cd
    • P
      Change default block cache from 8MB to 32MB (#11350) · 3c17930e
      Peter Dillinger 提交于
      Summary:
      ... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.
      
      Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.
      
      The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350
      
      Test Plan: unit tests updated
      
      Reviewed By: cbi42
      
      Differential Revision: D44674873
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
      3c17930e