1. 30 8月, 2022 1 次提交
    • C
      Verify Iterator/Get() against expected state in only `no_batched_ops_test` (#10590) · 5532b462
      Changyu Bi 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/10538 added `TestIterateAgainstExpected()` in `no_batched_ops_test` to verify iterator correctness against the in memory expected state. It is not compatible when run after some other stress tests, e.g. `TestPut()` in `batched_op_stress`, that either do not set expected state when writing to DB or use keys that cannot be parsed by `GetIntVal()`. The assert [here](https://github.com/facebook/rocksdb/blob/d17be55aab80b856f96f4af89f8d18fef96646b4/db_stress_tool/db_stress_common.h#L520) could fail. This PR fixed this issue by setting iterator upperbound to `max_key` when `destroy_db_initially=0` to avoid the key space that `batched_op_stress` touches.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10590
      
      Test Plan:
      ```
      # set up DB with batched_op_stress
      ./db_stress --test_batches_snapshots=1 --verify_iterator_with_expected_state_one_in=1 --max_key_len=3 --max_key=100000000 --skip_verifydb=1 --continuous_verification_interval=0 --writepercent=85 --delpercent=3 --delrangepercent=0 --iterpercent=10 --nooverwritepercent=1 --prefixpercent=0 --readpercent=2 --key_len_percent_dist=1,30,69
      
      # Before this PR, the following test will fail the asserts with error msg like the following
      # Assertion failed: (size_key <= key_gen_ctx.weights.size() * sizeof(uint64_t)), function GetIntVal, file db_stress_common.h, line 524.
      ./db_stress --verify_iterator_with_expected_state_one_in=1 --max_key_len=3 --max_key=100000000 --skip_verifydb=1 --continuous_verification_interval=0 --writepercent=0 --delpercent=3 --delrangepercent=0 --iterpercent=95 --nooverwritepercent=1 --prefixpercent=0 --readpercent=2 --key_len_percent_dist=1,30,69 --destroy_db_initially=0
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D39085243
      
      Pulled By: cbi42
      
      fbshipit-source-id: a7dfee2320c330773b623b442d730fd014ec7056
      5532b462
  2. 25 8月, 2022 1 次提交
    • C
      Add Iterator test against expected state to stress test (#10538) · d140fbfd
      Changyu Bi 提交于
      Summary:
      As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
      `db_stress` does not have much verification for iterator correctness.
      It has a `TestIterate()` function, but that is mainly for comparing results
      between two iterators, one with `total_order_seek` and the other optionally
      sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
      added a new `TestIterateAgainstExpected()` function that compares iterator against
      expected state. It locks a range of keys, creates an iterator, does
      a random sequence of `Next/Prev` and compares against expected state.
      This PR is based on that commit, the main changes include some logs
      (for easier debugging if a test fails), a forward and backward scan to
      cover the entire locked key range, and a flag for optionally turning on
      this version of Iterator testing.
      
      Added constraint that the checks against expected state in
      `TestIterateAgainstExpected()` and in `TestGet()` are only turned on
      when `--skip_verifydb` flag is not set.
      Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538
      
      Test Plan:
      Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
      and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
      manually to ensure expected coverage. Tweaked part of the code in
      https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
      - internally run various flavor of crash test
      
      Reviewed By: ajkr
      
      Differential Revision: D38847269
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
      d140fbfd
  3. 24 8月, 2022 1 次提交
    • C
      Update `TestGet()` to verify against expected state (#10553) · 198e5d8e
      Changyu Bi 提交于
      Summary:
      updated `TestGet()` in `no_batched_op_stress` to check the result of `Get()` operations against expected state (`expected_state_manager_`). More specifically, if `Get()` finds a key, expected state should not have `DELETION_SENTINEL` for the same key, and if `Get()` returns NotFound for a key, expected state should not have the key. One intention for this change it to verify correctness of code path change regarding range tombstones.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10553
      
      Test Plan: run db_stress with nonzero readpercent: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`. When I initially used wrong column family in `thread->shared->Get`, the test reported inconsistencies.
      
      Reviewed By: ajkr
      
      Differential Revision: D38927007
      
      Pulled By: cbi42
      
      fbshipit-source-id: f9f61b312ad0b4c21a799329609ba8526169b048
      198e5d8e
  4. 13 8月, 2022 1 次提交
    • C
      Add memtable per key-value checksum (#10281) · fd165c86
      Changyu Bi 提交于
      Summary:
      Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281
      
      Test Plan:
      - Added new unit test cases: `make check`
      - Benchmark on memtable insert
      ```
      TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100
      
      # avg over 10 runs
      Baseline: 1166936 ops/sec
      memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
      memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
      write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
      ```
      -  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
      ops/sec for `readrandom` dropped ~6.8%.
      ```
      # Readseq
      sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100
      
      readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
      readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec
      
      With -memtable_protection_bytes_per_key=2:
      
      readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
      readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec
      
      # Readrandom
      sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
      readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
      readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec
      
      With -memtable_protection_bytes_per_key=2:
      readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
      readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
      ```
      
      - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`
      
      Reviewed By: ajkr
      
      Differential Revision: D37607896
      
      Pulled By: cbi42
      
      fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
      fd165c86
  5. 09 8月, 2022 1 次提交
  6. 20 7月, 2022 1 次提交
    • Y
      Stop operating on DB in a stress test background thread (#10373) · b443d24f
      Yanqin Jin 提交于
      Summary:
      Stress test background threads do not coordinate with test worker
      threads for db reopen in the middle of a test run, thus accessing db
      obj in a stress test bg thread can race with test workers. Remove the
      TimestampedSnapshotThread.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373
      
      Test Plan:
      ```
      ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \
      --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \
      --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \
      --block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \
      --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \
      --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \
      --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
      --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \
      --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \
      --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \
      --continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \
      --db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \
      --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \
      --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \
      --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
      --get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \
      --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
      --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \
      --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \
      --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \
      --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \
      --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \
      --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \
      --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \
      --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \
      --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \
      --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \
      --readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \
      --secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \
      --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \
      --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \
      --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \
      --txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \
      --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \
      --use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \
      --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \
      --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \
      --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35
      ```
      make crash_test_with_txn
      make crash_test_with_multiops_wc_txn
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37903189
      
      Pulled By: riversand963
      
      fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9
      b443d24f
  7. 19 7月, 2022 1 次提交
  8. 17 7月, 2022 1 次提交
  9. 14 7月, 2022 1 次提交
  10. 12 7月, 2022 1 次提交
  11. 06 7月, 2022 1 次提交
    • Y
      Expand stress test coverage for user-defined timestamp (#10280) · caced09e
      Yanqin Jin 提交于
      Summary:
      Before this PR, we call `now()` to get the wall time before performing point-lookup and range
      scans when user-defined timestamp is enabled.
      
      With this PR, we expand the coverage to:
      - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now()
      - add coverage for `ReadOptions::iter_start_ts != nullptr`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280
      
      Test Plan:
      ```bash
      make check
      ```
      
      Also,
      ```bash
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
      ```
      
      So far, we have had four successful runs of the above
      
      In addition,
      ```bash
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test
      ```
      Succeeded twice showing no regression.
      
      Reviewed By: ltamasi
      
      Differential Revision: D37539805
      
      Pulled By: riversand963
      
      fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5
      caced09e
  12. 01 7月, 2022 1 次提交
  13. 30 6月, 2022 1 次提交
    • G
      Clock cache (#10273) · 57a0e2f3
      Guido Tagliavini Ponce 提交于
      Summary:
      This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
      - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
      - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
      - Middle insertions into the clock list.
      - A test that exercises the clock eviction policy.
      - Update the Java API of ClockCache and Java calls to C++.
      
      Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37522461
      
      Pulled By: guidotag
      
      fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
      57a0e2f3
  14. 29 6月, 2022 1 次提交
  15. 28 6月, 2022 1 次提交
  16. 25 6月, 2022 1 次提交
    • A
      Temporarily disable mempurge in crash test (#10252) · f322f273
      Andrew Kryczka 提交于
      Summary:
      Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
      
      Reviewed By: riversand963
      
      Differential Revision: D37432948
      
      Pulled By: ajkr
      
      fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
      f322f273
  17. 24 6月, 2022 1 次提交
    • B
      Dynamically changeable `MemPurge` option (#10011) · 5879053f
      Baptiste Lemaire 提交于
      Summary:
      **Summary**
      Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
      
      **Motivation**
      RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
      Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
      
      **Content of this PR**
      This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
      
      **Benchmarking**
      I will add numbers to prove that there is no performance impact within the next 12 hours.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
      
      Reviewed By: pdillinger
      
      Differential Revision: D36462357
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
      5879053f
  18. 23 6月, 2022 1 次提交
    • G
      Add the blob cache to the stress tests and the benchmarking tool (#10202) · 2352e2df
      Gang Liao 提交于
      Summary:
      In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
      As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202
      
      Reviewed By: ltamasi
      
      Differential Revision: D37325739
      
      Pulled By: gangliao
      
      fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
      2352e2df
  19. 22 6月, 2022 1 次提交
  20. 18 6月, 2022 1 次提交
    • H
      Add rate-limiting support to batched MultiGet() (#10159) · a5d773e0
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
      
      However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
      
      Test Plan:
      - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
      - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
      - Stress test:
         - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
      
      Reviewed By: ajkr, anand1976
      
      Differential Revision: D37135172
      
      Pulled By: hx235
      
      fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
      a5d773e0
  21. 17 6月, 2022 2 次提交
    • A
      Add WriteOptions::protection_bytes_per_key (#10037) · 5d6005c7
      Andrew Kryczka 提交于
      Summary:
      Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
      
      Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
      
      There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
      
      Test Plan:
      - Manual
        - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
        - Enabled in MyShadow for 1+ week
      - Automated
        - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
        - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
      
      Reviewed By: cbi42
      
      Differential Revision: D36614569
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
      5d6005c7
    • P
      Remove deprecated block-based filter (#10184) · 126c2237
      Peter Dillinger 提交于
      Summary:
      In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
      the public API, because of its inefficiency. Although we normally maintain read compatibility
      on old DBs forever, filters are not required for reading a DB, only for optimizing read
      performance. Thus, it should be acceptable to remove this code and the substantial
      maintenance burden it carries as useful features are developed and validated (such
      as user timestamp).
      
      This change completely removes the code for reading and writing the old block-based
      filters, net removing about 1370 lines of code no longer needed. Options removed from
      testing / benchmarking tools. The prior existence is only evident in a couple of places:
      * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
      a major release to minimize source code incompatibilities.
      * A warning is logged when an old table file is opened that used the old block-based
      filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
      should suffice. Unfortunately, sst_dump does not tell you whether a file uses
      block-based filter, and the structure of the code makes it very difficult to fix.
      * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
      for metaindex is maintained (for now).
      
      Other notes:
      * In some cases where numbers are associated with filter configurations, we have had to
      update the assigned numbers so that they all correspond to something that exists.
      * Fixed potential stat counting bug by assuming `filter_checked = false` for cases
      like `filter == nullptr` rather than assuming `filter_checked = true`
      * Removed obsolete `block_offset` and `prefix_extractor` parameters from several
      functions.
      * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
      because the caller guarantees the prefix extractor exists and is compatible
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
      
      Test Plan:
      tests updated, manually test new warning in LOG using base version to
      generate a DB
      
      Reviewed By: riversand963
      
      Differential Revision: D37212647
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
      126c2237
  22. 16 6月, 2022 1 次提交
    • Y
      Allow db_bench and db_stress to set `allow_data_in_errors` (#10171) · ce419c0f
      Yanqin Jin 提交于
      Summary:
      There is `Options::allow_data_in_errors` that controls whether RocksDB
      is allowed to log data, e.g. key, value, etc in LOG files. It is false
      by default. However, in db_bench and db_stress, it is often ok to log
      data because there is no concern about privacy.
      
      This PR allows db_stress and db_bench to set this option on the command
      line, while it remains false by default. Furthermore, make
      crash/recovery test driven by db_crashtest.py to opt-in.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171
      
      Test Plan: Stress test and db_bench
      
      Reviewed By: hx235
      
      Differential Revision: D37163787
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
      ce419c0f
  23. 15 6月, 2022 1 次提交
    • H
      Account memory of FileMetaData in global memory limit (#9924) · d665afdb
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
      
      Test Plan:
      - Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
      - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
      - db bench (CPU cost of `charge_file_metadata` in write and compact)
         - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
         - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
      
      table 1 - write
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
      80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
      160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
      
      table 2 - compact
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
      20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
      40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
      80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
      
      - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36055583
      
      Pulled By: hx235
      
      fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
      d665afdb
  24. 14 6月, 2022 2 次提交
    • G
      Make the per-shard hash table fixed-size. (#10154) · f105e1a5
      Guido Tagliavini Ponce 提交于
      Summary:
      We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
      
      Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
      
      Test Plan: `make -j24 check`
      
      Reviewed By: pdillinger
      
      Differential Revision: D37124451
      
      Pulled By: guidotag
      
      fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
      f105e1a5
    • Y
      Fix a race condition in transaction stress test (#10157) · bfaf8291
      Yanqin Jin 提交于
      Summary:
      Before this PR, there can be a race condition between the thread calling
      `StressTest::Open()` and a background compaction thread calling
      `MultiOpsTxnsStressTest::VerifyPkSkFast()`.
      
      ```
      Time   thread1                             bg_compact_thr
       |     TransactionDB::Open(..., &txn_db_)
       |     db_ is still nullptr
       |                                         db_->GetSnapshot()  // segfault
       |     db_ = txn_db_
       V
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157
      
      Test Plan: CI
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37121653
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331
      bfaf8291
  25. 11 6月, 2022 1 次提交
    • Y
      Snapshots with user-specified timestamps (#9879) · 1777e5f7
      Yanqin Jin 提交于
      Summary:
      In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
      to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
      
      It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
      do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
      and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
      
      This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
      https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
      Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
      
      In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
      object is created with the last published sequence number of the super-version. You can see that the reader actually
      has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
      an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
      sequence number is written.
      
      This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
      exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
      commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
      ensure any two snapshots with timestamps should satisfy the following:
      ```
      snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
      ```
      
      If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
      in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
      a snapshot with associated timestamp.
      
      Code example
      ```cpp
      // Create a timestamped snapshot when committing transaction.
      txn->SetCommitTimestamp(100);
      txn->SetSnapshotOnNextOperation();
      txn->Commit();
      
      // A wrapper API for convenience
      Status Transaction::CommitAndTryCreateSnapshot(
          std::shared_ptr<TransactionNotifier> notifier,
          TxnTimestamp ts,
          std::shared_ptr<const Snapshot>* ret);
      
      // Create a timestamped snapshot if caller guarantees no concurrent writes
      std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
      ```
      
      The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
      other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
      ```cpp
      // Return the timestamped snapshot correponding to given timestamp. If ts is
      // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
      // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
      // such snapshot exists, then we return null.
      std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
      // Return the latest timestamped snapshot if present.
      std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
      ```
      
      We also provide two additional APIs for stats collection and reporting purposes.
      
      ```cpp
      Status TransactionDB::GetAllTimestampedSnapshots(
          std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
      // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
      Status TransactionDB::GetTimestampedSnapshots(
          TxnTimestamp ts_lb,
          TxnTimestamp ts_ub,
          std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
      ```
      
      To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
      timestamped snapshots whose timestamps are older than or equal to a given threshold.
      ```cpp
      void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
      ```
      
      Before shutdown, RocksDB will release all timestamped snapshots.
      
      Comparison with user-defined timestamp and how they can be combined:
      User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
      mapping between snapshots (sequence numbers) and timestamps.
      Different internal keys with the same user key but different timestamps will be treated as different by compaction,
      thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
      In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
      this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
      The timestamped snapshot supports the semantics of reading at an exact point in time.
      
      Timestamped snapshots can also be used with user-defined timestamp.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
      
      Test Plan:
      ```
      make check
      TEST_TMPDIR=/dev/shm make crash_test_with_txn
      ```
      
      Reviewed By: siying
      
      Differential Revision: D35783919
      
      Pulled By: riversand963
      
      fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
      1777e5f7
  26. 08 6月, 2022 2 次提交
    • Y
      Update test for secondary instance in stress test (#10121) · f890527b
      Yanqin Jin 提交于
      Summary:
      This PR updates secondary instance testing in stress test by default.
      
      A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.
      
      Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
      to read some random keys with only very basic verification to make sure no assertion failure is triggered.
      
      Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.
      
      Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
      catch-up.
      
      In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
      work well when the underlying file is renamed by primary, e.g. SstFileManager.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121
      
      Test Plan:
      ```
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D36939458
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
      f890527b
    • A
      Set db_stress defaults for TSAN deadlock detector (#10131) · ff323464
      Andrew Kryczka 提交于
      Summary:
      After https://github.com/facebook/rocksdb/issues/9357 we began seeing the following error attempting to acquire
      locks for file ingestion:
      
      ```
      FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
      ```
      
      The command was using default values for `ingest_external_file_width`
      (1000) and `log2_keys_per_lock` (2). The expected number of locks needed
      to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64)
      limit. This PR reduces the default value of `ingest_external_file_width`
      to 100 so the expected number of locks is 25, which is within the limit.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10131
      
      Reviewed By: ltamasi
      
      Differential Revision: D36986307
      
      Pulled By: ajkr
      
      fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1
      ff323464
  27. 03 6月, 2022 1 次提交
    • G
      Make it possible to enable blob files starting from a certain LSM tree level (#10077) · e6432dfd
      Gang Liao 提交于
      Summary:
      Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
      
      In order to achieve this, we would like to do the following:
      - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
      - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
      - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
      - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
      - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
      - Ideally extend the C and Java bindings with the new option
      - Update the BlobDB wiki to document the new option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
      
      Reviewed By: ltamasi
      
      Differential Revision: D36884156
      
      Pulled By: gangliao
      
      fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
      e6432dfd
  28. 02 6月, 2022 2 次提交
    • G
      Support specifying blob garbage collection parameters when CompactRange() (#10073) · 3dc6ebaf
      Gang Liao 提交于
      Summary:
      Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve:
      
      - Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured
      - Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy`
      - Updating the BlobDB wiki to document the new options.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073
      
      Test Plan: Adding unit tests and adding the new options to the stress test tool.
      
      Reviewed By: ltamasi
      
      Differential Revision: D36848700
      
      Pulled By: gangliao
      
      fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9
      3dc6ebaf
    • G
      Add support for FastLRUCache in stress and crash tests. (#10081) · b4d0e041
      Guido Tagliavini Ponce 提交于
      Summary:
      Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache.
      
      Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081
      
      Test Plan:
      - To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`.
      - To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`.
      
      Reviewed By: anand1976
      
      Differential Revision: D36839908
      
      Pulled By: guidotag
      
      fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725
      b4d0e041
  29. 27 5月, 2022 1 次提交
  30. 25 5月, 2022 1 次提交
  31. 21 5月, 2022 1 次提交
    • C
      Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) · cc23b46d
      Changyu Bi 提交于
      Summary:
      An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
      
      Test Plan:
      #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
      Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
      ```
      # linked local ZSTD with version 1.5.2
      # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
      
      dict_bytes=16384
      train_bytes=1048576
      echo "========== No Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== Raw Content Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== FinalizeDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== TrainDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
      # before compression data size: 1.2GB
      dict_bytes=16384
      max_dict_buffer_bytes =  1048576
                          space   cpu/memory
      No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
      Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
      FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
      TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
      ```
      
      #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
      FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
      ```
      dict_bytes=16384
      train_bytes=1048576
      
      for sst_file in `ls ../temp/myrock-sst/`
      do
        echo "********** $sst_file **********"
        echo "========== No Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
      
        echo "========== Raw Content Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
      
        echo "========== FinalizeDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
      
        echo "========== TrainDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
      done
      
                               010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
      No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
      Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
      FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
      TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
      ```
      
      #### Decompression/Read test:
      With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
      ```
      dict_bytes=16384
      train_bytes=1048576
      echo "No Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
      
      echo "Raw Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
      
      echo "FinalizeDict"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
      
      echo "Train Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
      
      No Dictionary
      readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
      Raw Dictionary
      readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
      FinalizeDict
      readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
      Train Dictionary
      readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D35720026
      
      Pulled By: cbi42
      
      fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
      cc23b46d
  32. 20 5月, 2022 1 次提交
    • J
      Track SST unique id in MANIFEST and verify (#9990) · c6d326d3
      Jay Zhuang 提交于
      Summary:
      Start tracking SST unique id in MANIFEST, which is used to verify with
      SST properties to make sure the SST file is not overwritten or
      misplaced. A DB option `try_verify_sst_unique_id` is introduced to
      enable/disable the verification, if enabled, it opens all SST files
      during DB-open to read the unique_id from table properties (default is
      false), so it's recommended to use it with `max_open_files = -1` to
      pre-open the files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990
      
      Test Plan: unittests, format-compatible test, mini-crash
      
      Reviewed By: anand1976
      
      Differential Revision: D36381863
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
      c6d326d3
  33. 19 5月, 2022 2 次提交
    • Y
      Remove ROCKSDB_SUPPORT_THREAD_LOCAL define because it's a part of C++11 (#10015) · 0a43061f
      Yaroslav Stepanchuk 提交于
      Summary:
      ROCKSDB_SUPPORT_THREAD_LOCAL definition has been removed.
      `__thread`(#define) has been replaced with `thread_local`(C++ keyword) across the code base.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10015
      
      Reviewed By: siying
      
      Differential Revision: D36485491
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6522d212514ee190b90b4e2750c80c7e34013c78
      0a43061f
    • Y
      Avoid overwriting options loaded from OPTIONS (#9943) · e3a3dbf2
      Yanqin Jin 提交于
      Summary:
      This is similar to https://github.com/facebook/rocksdb/issues/9862, including the following fixes/refactoring:
      
      1. If OPTIONS file is specified via `-options_file`, majority of options will be loaded from the file. We should not
      overwrite options that have been loaded from the file. Instead, we configure only fields of options which are
      shared objects and not set by the OPTIONS file. We also configure a few fields, e.g. `create_if_missing` necessary
      for stress test to run.
      
      2. Refactor options initialization into three functions, `InitializeOptionsFromFile()`, `InitializeOptionsFromFlags()`
      and `InitializeOptionsGeneral()` similar to db_bench. I hope they can be shared in the future. The high-level logic is
      as follows:
      ```cpp
      if (!InitializeOptionsFromFile(...)) {
        InitializeOptionsFromFlags(...);
      }
      InitializeOptionsGeneral(...);
      ```
      
      3. Currently, the setting for `block_cache_compressed` does not seem correct because it by default specifies a
      size of `numeric_limits<size_t>::max()` ((size_t)-1). According to code comments, `-1` indicates default value,
      which should be referring to `num_shard_bits` argument.
      
      4. Clarify `fail_if_options_file_error`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9943
      
      Test Plan:
      1. make check
      2. Run stress tests, and manually check generated OPTIONS file and compare them with input OPTIONS files
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36133769
      
      Pulled By: riversand963
      
      fbshipit-source-id: 35dacdc090a0a72c922907170cd132b9ecaa073e
      e3a3dbf2
  34. 18 5月, 2022 2 次提交
    • P
      Adjust public APIs to prefer 128-bit SST unique ID (#10009) · 0070680c
      Peter Dillinger 提交于
      Summary:
      128 bits should suffice almost always and for tracking in manifest.
      
      Note that this changes the output of sst_dump --show_properties to only show 128 bits.
      
      Also introduces InternalUniqueIdToHumanString for presenting internal IDs for debugging purposes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10009
      
      Test Plan: unit tests updated
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36458189
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 93ebc4a3b6f9c73ee154383a1f8b291a5d6bbef5
      0070680c
    • H
      Rewrite memory-charging feature's option API (#9926) · 3573558e
      Hui Xiao 提交于
      Summary:
      **Context:**
      Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
      
      Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
      
      **Summary:**
      - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
      - Added missing db bench/stress test for some memory charging features
      - Renamed related test suite to indicate they are under the same theme of memory charging
      - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
      - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
      
      Test Plan:
      - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
      - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
      - CI
      - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR  -charge_compression_dictionary_building_buffer=1(remove this for comparison)  -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
      
      - db_stress: `python3 tools/db_crashtest.py blackbox  -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36054712
      
      Pulled By: hx235
      
      fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
      3573558e