1. 31 5月, 2023 2 次提交
    • P
      Better management of unreleased HISTORY (#11481) · 8848ec92
      Peter Dillinger 提交于
      Summary:
      See new NOTE in HISTORY.md and unreleased_history/README.txt
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11481
      
      Test Plan: some manual testing on my CentOS 8 system
      
      Reviewed By: jaykorean
      
      Differential Revision: D46233342
      
      Pulled By: pdillinger
      
      fbshipit-source-id: daf59cf3dc907f450b469090dcc481a30a7d7c0d
      8848ec92
    • A
      Integrate CacheReservationManager with compressed secondary cache (#11449) · fcc358ba
      anand76 提交于
      Summary:
      This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows.
      
      When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches.
      
      The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows -
        |---------    Primary Cache Configured Capacity  -----------|
        |---Secondary Cache Budget----|----Primary Cache Budget-----|
      
      A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired.
      
        |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---|
      
      When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation.
      
      For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion -
      
        |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-|
      
      Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount.
      
      Other alternatives -
      1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead.
      2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them.
      
      The PR contains the following changes -
      1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```.
      3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```.
      4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```.
      5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted.
      
      Benchmarks -
      Baseline
      ```
      time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true
      ```
      ```
      readseq      :       3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s
      readwhilewriting :      22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found)
      
      real    6m31.052s
      user    152m5.660s
      sys     26m18.738s
      ```
      With TieredVolatileCache
      ```
      time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true
      ```
      ```
      readseq      :       4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s
      readwhilewriting :      20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found)
      
      real    6m42.743s
      user    157m58.972s
      sys     33m16.671
      ```
      ```
      readseq      :       3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s
      readwhilewriting :      21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found)
      
      real    6m31.469s
      user    155m16.570s
      sys     27m47.834s
      ```
      
      ToDo -
      1. Add to db_stress
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449
      
      Reviewed By: pdillinger
      
      Differential Revision: D46197388
      
      Pulled By: anand1976
      
      fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
      fcc358ba
  2. 27 5月, 2023 1 次提交
    • A
      add WriteBatch::Release() (#11482) · 3e7fc881
      Andrew Kryczka 提交于
      Summary:
      Together with the existing constructor,
      `explicit WriteBatch(std::string&& rep)`, this enables transferring `WriteBatch` via its `std::string` representation. Associated info like KV checksums are dropped but the caller can use `WriteBatch::VerifyChecksum()` before taking ownership if needed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11482
      
      Reviewed By: cbi42
      
      Differential Revision: D46233884
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6bc64a6e75fb7bbf61d08c09520fc3705a7b44d8
      3e7fc881
  3. 26 5月, 2023 3 次提交
    • J
      Move WaitForCompect() change entry to Unreleased in History file (#11479) · 23f4e9ad
      Jay Huh 提交于
      Summary:
      Context:
      
      Because of the branch cut, History change made it to the previous release. Moving entry to Unreleased
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11479
      
      Test Plan: History change. Not needed.
      
      Reviewed By: pdillinger
      
      Differential Revision: D46226237
      
      Pulled By: jaykorean
      
      fbshipit-source-id: 33e7d84a05db254d227f05d76038fc6d225dbabf
      23f4e9ad
    • J
      Add WaitForCompact with WaitForCompactOptions to public API (#11436) · 81aeb159
      Jay Huh 提交于
      Summary:
      Context:
      
      This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`
      
      1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
      2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.
      
      1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
      2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
      3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436
      
      Test Plan:
      Following tests added
      - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
      - `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
      - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
      - `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
      - `TransactionTest::WaitForCompactAbortOnPause`
      
      NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.
      
      Reviewed By: pdillinger
      
      Differential Revision: D45799659
      
      Pulled By: jaykorean
      
      fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
      81aeb159
    • H
      Fix StopWatch bug; Remove setting `record_read_stats` (#11474) · dcc6fc99
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      - StopWatch enable stats even when `StatsLevel::kExceptTimers` is set. It's a harmless bug though since `reportTimeToHistogram()` will not report it anyway according to https://github.com/facebook/rocksdb/blob/main/include/rocksdb/statistics.h#L705
      -  https://github.com/facebook/rocksdb/pull/11288 should have removed logics of setting `record_read_stats = !for_compaction` as we don't differentiate `RandomAccessFileReader`'s stats behavior based on compaction or not (instead we now report stats of different IO activities including compaction to different stats). Fixing this should report more compaction related file read micros that aren't reported previously due to `for_compaction==true`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11474
      
      Test Plan:
      - DB bench pre vs post fix with small max_open_files
      
      Setup command
      `./db_ bench  -db=/dev/shm/testdb/ -statistics=true -benchmarks=fillseq -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3`
      
      Run command
      `./db_bench --open_files=1 -use_existing_db=true -db=/dev/shm/testdb2/ -statistics=true -benchmarks=compactall -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3`
      
      Pre-fix
      ```
      rocksdb.sst.read.micros P50 : 2.056175 P95 : 4.647739 P99 : 8.948475 P100 : 25.000000 COUNT : 4451 SUM : 12827
      rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.compaction.micros P50 : 2.057397 P95 : 4.625253 P99 : 8.749474 P100 : 25.000000 COUNT : 4382 SUM : 12608
      rocksdb.file.read.db.open.micros P50 : 1.985294 P95 : 9.100000 P99 : 13.000000 P100 : 13.000000 COUNT : 69 SUM : 219
      ```
      
      Post-fix (with a higher `rocksdb.file.read.compaction.micros` count)
      ```
      rocksdb.sst.read.micros P50 : 1.858968 P95 : 3.653086 P99 : 5.968000 P100 : 21.000000 COUNT : 3548 SUM : 9119
      rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.compaction.micros P50 : 1.857027 P95 : 3.627614 P99 : 5.738621 P100 : 21.000000 COUNT : 3479 SUM : 8904
      rocksdb.file.read.db.open.micros P50 : 2.000000 P95 : 6.733333 P99 : 11.000000 P100 : 11.000000 COUNT : 69 SUM : 215
      ```
      - CI
      
      Reviewed By: ajkr
      
      Differential Revision: D46137221
      
      Pulled By: hx235
      
      fbshipit-source-id: e5b4ee7001af26f2ee0377bc6334f43b2a527388
      dcc6fc99
  4. 25 5月, 2023 1 次提交
    • P
      Improve memory efficiency of many OptimisticTransactionDBs (#11439) · 17bc2774
      Peter Dillinger 提交于
      Summary:
      Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually!
      
      In this change:
      * Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway.
      * Allocate the mutexes in a contiguous array, for efficiency
      * Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose.
      * Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing)
      * Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439
      
      Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!)
      
      Reviewed By: ltamasi
      
      Differential Revision: D45796393
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54
      17bc2774
  5. 20 5月, 2023 2 次提交
    • Y
      Update HISTORY.md/version.h/format compatiblity test for 8.3 release (#11464) · 509116c5
      Yu Zhang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11464
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D46041333
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 7d83cf9e611451fcc7f7e4a837681ed0d4271df4
      509116c5
    • P
      Much better stats for seeks and prefix filtering (#11460) · 39f5846e
      Peter Dillinger 提交于
      Summary:
      We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
      
      This change does several things:
      * Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
        * Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
        * We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
        * For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
        * The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
      * The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
      * Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
      
      Test Plan:
      unit tests updated, including updating many to pop the stat value since last read to improve test
      readability and maintainability.
      
      Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
      
      Create DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
      ```
      And run simultaneous before&after with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
      ```
      Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec;   18.4 (± 0.0) MB/sec
      After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec;   19.1 (± 0.0) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D46029177
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
      39f5846e
  6. 19 5月, 2023 2 次提交
    • M
      Support Clip DB to KeyRange (#11379) · 8d8eb0e7
      mayue.fight 提交于
      Summary:
      This PR is part of the request https://github.com/facebook/rocksdb/issues/11317.
      (Another part is https://github.com/facebook/rocksdb/pull/11378)
      
      ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones).
       This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs.
      
      When Calling ClipDB [begin, end), there are the following steps
      
      1.  Quickly and directly delete files without overlap
       DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr)
      2. Delete the Key outside the range
      Delete[smallest_key, begin) + Delete[end, largest_key]
      3. Delete the tombstone through Manul Compact
      CompactRange(option, nullptr, nullptr)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379
      
      Reviewed By: ajkr
      
      Differential Revision: D45840358
      
      Pulled By: cbi42
      
      fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4
      8d8eb0e7
    • H
      Add `rocksdb.file.read.db.open.micros` (#11455) · 50046869
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      `rocksdb.file.read.db.open.micros` is left out in https://github.com/facebook/rocksdb/pull/11288
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11455
      
      Test Plan:
      - db bench
      Setup: `./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`
      Run:
      `./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/  -benchmarks=readrandom  -key_size=3200 -value_size=512 -num=0 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none -file_checksum=1 -cache_size=1`
      
      ```
      rocksdb.sst.read.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
      rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.db.open.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D45951934
      
      Pulled By: hx235
      
      fbshipit-source-id: 6c88639dc1b10d98ecccc963ce32a8800495f55b
      50046869
  7. 13 5月, 2023 2 次提交
    • A
      Delete temp OPTIONS file on failure to write it (#11423) · 2084cdf2
      anand76 提交于
      Summary:
      When the DB is opened, RocksDB creates a temp OPTIONS file, writes the current options to it, and renames it. In case of a failure, the temp file is left behind, and is not deleted by PurgeObsoleteFiles(). Fix this by explicitly deleting the temp file if writing to it or renaming it fails.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11423
      
      Test Plan: Add a unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D45540454
      
      Pulled By: anand1976
      
      fbshipit-source-id: 47facdc30d8cc5667036312d04b21d3fc253c92e
      2084cdf2
    • A
      Add block checksum mismatch ticker stat (#11438) · 113f3250
      Andrew Kryczka 提交于
      Summary:
      Added a ticker stat, `BLOCK_CHECKSUM_MISMATCH_COUNT`, to count how many block checksum verifications detected a mismatch.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11438
      
      Test Plan: new unit test
      
      Reviewed By: pdillinger
      
      Differential Revision: D45788179
      
      Pulled By: ajkr
      
      fbshipit-source-id: e2b44eba7c23b3e110ebe69eaa78a710dec2590f
      113f3250
  8. 12 5月, 2023 1 次提交
    • C
      Support compacting files to different temperatures in FIFO compaction (#11428) · 8827cd06
      Changyu Bi 提交于
      Summary:
      - Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable.
      - Deprecate the old option `age_for_warm` that was added for a similar purpose.
      - Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature.
      - Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428
      
      Test Plan: adapted unit tests that were for `age_for_warm` to this new option.
      
      Reviewed By: ajkr
      
      Differential Revision: D45611412
      
      Pulled By: cbi42
      
      fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106
      8827cd06
  9. 10 5月, 2023 2 次提交
    • P
      Simplify detection of x86 CPU features (#11419) · 459969e9
      Peter Dillinger 提交于
      Summary:
      **Background** - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant.
      
      What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512):
      
      PORTABLE=1 empty USE_SSE  : xxh3->8 GB/s   crc32c->0.8 GB/s  (no SSE4.2 nor AVX2 instructions)
      PORTABLE=1 USE_SSE=1      : xxh3->19 GB/s  crc32c->16 GB/s  (with SSE4.2 and AVX2)
      PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s  crc32c->16 GB/s  (also some AVX512)
      
      Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases.
      
      x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO.
      
      **This change** does two related things:
      * Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.)
      * Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway.
      
      **In more detail:**
      * `HAVE_SSE42`, `HAVE_AVX2`, and `HAVE_PCLMUL` replaced by standard macros `__SSE4_2__`, `__AVX2__`, and `__PCLMUL__`.
      * MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from `__AVX__` or `__AVX2__` in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue.
      * `__POPCNT__` is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer.
      * `PORTABLE` can be set to a CPU type, e.g. "haswell", to compile for that CPU type.
      * `USE_SSE` is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior.
      
      Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11419
      
      Test Plan:
      existing tests, CI
      
      Manual performance tests after the change match the before above (none expected with make build).
      
      We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)
      
      Reviewed By: ajkr
      
      Differential Revision: D45489041
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 60ceb0dd2aa3b365c99ed08a8b2a087a9abb6a70
      459969e9
    • P
      Add hash_seed to Caches (#11391) · f4a02f2c
      Peter Dillinger 提交于
      Summary:
      See motivation and description in new ShardedCacheOptions::hash_seed option.
      
      Updated db_bench so that its seed param is used for the cache hash seed.
      Made its code more safe to ensure seed is set before use.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391
      
      Test Plan:
      unit tests added / updated
      
      **Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache.
      
      Reviewed By: hx235
      
      Differential Revision: D45557797
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934
      f4a02f2c
  10. 09 5月, 2023 1 次提交
    • H
      Record and use the tail size to prefetch table tail (#11406) · 8f763bde
      Hui Xiao 提交于
      Summary:
      **Context:**
      We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.
      
      **Summary:**
      - Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
         - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
      - Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406
      
      Test Plan:
      1. New UT
      2. db bench
      Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
      ```
       diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
      index bd5669f0f..791484c1f 100644
       --- a/table/block_based/block_based_table_reader.cc
      +++ b/table/block_based/block_based_table_reader.cc
      @@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                                  &tail_prefetch_size);
      
         // Try file system prefetch
      -  if (!file->use_direct_io() && !force_direct_prefetch) {
      +  if (false && !file->use_direct_io() && !force_direct_prefetch) {
           if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
                    .IsNotSupported()) {
             prefetch_buffer->reset(new FilePrefetchBuffer(
       diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
      index ea40f5fa0..39a0ac385 100644
       --- a/tools/db_bench_tool.cc
      +++ b/tools/db_bench_tool.cc
      @@ -4191,6 +4191,8 @@ class Benchmark {
                 std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
           } else {
             BlockBasedTableOptions block_based_options;
      +      block_based_options.metadata_cache_options.partition_pinning =
      +      PinningTier::kAll;
             block_based_options.checksum =
                 static_cast<ChecksumType>(FLAGS_checksum_type);
             if (FLAGS_use_hash_search) {
      ```
      Create DB
      ```
      ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
      ```
      ReadRandom
      ```
      ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
      ```
      (a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
      ```
      rocksdb.table.open.prefetch.tail.hit COUNT : 3395
      rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
      ```
      
      (b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
      ```
      rocksdb.table.open.prefetch.tail.hit COUNT : 14257
      rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
      ```
      
      As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR
      
      3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.
      
      Reviewed By: pdillinger
      
      Differential Revision: D45413346
      
      Pulled By: hx235
      
      fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
      8f763bde
  11. 03 5月, 2023 1 次提交
    • A
      Delete empty WAL files on reopen (#11409) · 03a892a9
      anand76 提交于
      Summary:
      When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409
      
      Test Plan: Add a unit test
      
      Reviewed By: ajkr
      
      Differential Revision: D45281685
      
      Pulled By: anand1976
      
      fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e
      03a892a9
  12. 02 5月, 2023 2 次提交
    • P
      Avoid long parameter lists configuring Caches (#11386) · 41a7fbf7
      Peter Dillinger 提交于
      Summary:
      For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr.
      
      Net reduction of about 70 source code lines (including comments).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386
      
      Test Plan: existing tests
      
      Reviewed By: ajkr
      
      Differential Revision: D45059075
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a
      41a7fbf7
    • A
      Shard JemallocNodumpAllocator (#11400) · 925d8252
      Andrew Kryczka 提交于
      Summary:
      RocksDB's jemalloc no-dump allocator (`NewJemallocNodumpAllocator()`) was using a single manual arena. This arena's lock contention could be very high when thread caching is disabled for RocksDB blocks (e.g., when using `MALLOC_CONF='tcache_max:4096'` and `rocksdb_block_size=16384`).
      
      This PR changes the jemalloc no-dump allocator to use a configurable number of manual arenas. That number is required to be a power of two so we can avoid division. The allocator shards allocation requests randomly across those manual arenas.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11400
      
      Test Plan:
      - mysqld setup
        - Branch: fb-mysql-8.0.28 (https://github.com/facebook/mysql-5.6/commit/653eba2e56cfba4eac0c851ac9a70b2da9607527)
        - Build: `mysqlbuild.sh --clean --release`
        - Set env var `MALLOC_CONF='tcache_max:$tcache_max'`
        - Added CLI args `--rocksdb_cache_dump=false --rocksdb_block_cache_size=4294967296 --rocksdb_block_size=16384`
        - Ran under /usr/bin/time
      - Large database scenario
        - Setup command: `mysqlslap -h 127.0.0.1 -P 13020 --auto-generate-sql=1 --auto-generate-sql-load-type=write --auto-generate-sql-guid-primary=1 --number-char-cols=8 --auto-generate-sql-execute-number=262144 --concurrency=32 --no-drop`
        - Benchmark command: `mysqlslap -h 127.0.0.1 -P 13020 --query='select count(*) from mysqlslap.t1;' --number-of-queries=320 --concurrency=32`
        - Results:
      
      | tcache_max | num_arenas | Peak RSS MB (% change) | Query latency seconds (% change) |
      |---|---|---|---|
      | 4096 | **(baseline)** | 4541 | 37.1 |
      | 4096 | 1 | 4535 (-0.1%) | 36.7 (-1%) |
      | 4096 | 8 | 4687 (+3%) | 10.2 (-73%) |
      | 16384 | **(baseline)** | 4514 | 8.4 |
      | 16384 | 1 | 4526 (+0.3%) | 8.5 (+1%) |
      | 16384 | 8 | 4580 (+1%) | 8.5 (+1%) |
      
      Reviewed By: pdillinger
      
      Differential Revision: D45220794
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9a50c9872bdef5d299e52b115a65ee8a5557d58d
      925d8252
  13. 26 4月, 2023 1 次提交
    • C
      Block per key-value checksum (#11287) · 62fc15f0
      Changyu Bi 提交于
      Summary:
      add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
      1. checksum construction and verification in block.cc/h
      2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
      3. unit tests/crash test updates
      
      Tests:
      * Added unit tests
      * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
      
      Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
      
      Performance:
      Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
      For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
      
      ```
      SETUP
      make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
      ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
      
      BENCHMARK
      ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
      
      The readrandom ops/sec looks like the following:
      Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
      Main              240805     223604         198176           161653       139040
      PR prot_bytes=0   238691     226693         200127           161082       141153
      PR prot_bytes=1   214983     193199         178532           137013       108211
      prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
      prot_bytes=0
      ```
      
      The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
      
      Reviewed By: ajkr
      
      Differential Revision: D43970708
      
      Pulled By: cbi42
      
      fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
      62fc15f0
  14. 25 4月, 2023 1 次提交
  15. 22 4月, 2023 2 次提交
    • P
      Changes and enhancements to compression stats, thresholds (#11388) · d79be3dc
      Peter Dillinger 提交于
      Summary:
      ## Option API updates
      * Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
      * Remove unnecessary constructor for CompressionOptions.
      * Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).
      
       ## Stat API updates
      * Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
      * Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
      * New or existing tickers relevant to compression:
        * BYTES_COMPRESSED_FROM
        * BYTES_COMPRESSED_TO
        * BYTES_COMPRESSION_BYPASSED
        * BYTES_COMPRESSION_REJECTED
        * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
        * NUMBER_BLOCK_COMPRESSED (existing)
        * NUMBER_BLOCK_COMPRESSION_BYPASSED
        * NUMBER_BLOCK_COMPRESSION_REJECTED
        * BYTES_DECOMPRESSED_FROM
        * BYTES_DECOMPRESSED_TO
      
      We can compute a number of things with these stats:
      * "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
      * Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
      * Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
      * Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES
      
      Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.
      
       ## BlockBasedTableBuilder
      Various small refactorings, optimizations, and name clean-ups.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388
      
      Test Plan:
      unit tests added
      
      * `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.
      
      Reviewed By: ajkr
      
      Differential Revision: D45128202
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
      d79be3dc
    • H
      Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) · 151242ce
      Hui Xiao 提交于
      Summary:
      **Context:**
      The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
      
      **Summary**
      - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
         - Fixed the default histogram in `RandomAccessFileReader`
      - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
      - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
      
      Test Plan:
      - **Stress test**
      - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
           - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
      ```
      ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
      ```
      ```
      // BlockBasedTable
      rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
      rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
      rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
      
      // PlainTable
      Does not apply
      ```
      - **Db bench 2: performance**
      
      **Read**
      
      SETUP: db with 900 files
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
      ```run till convergence
      ```
      ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
      ```
      Pre-change
      `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
      Post-change (no regression, -0.3%)
      `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
      
      **Compaction/Flush**run till convergence
      ```
      ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
      
      rocksdb.sst.read.micros  COUNT : 33820
      rocksdb.sst.read.flush.micros COUNT : 1800
      rocksdb.sst.read.compaction.micros COUNT : 32020
      ```
      Pre-change
      `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Post-change (no regression, ~-0.4%)
      `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Reviewed By: ajkr
      
      Differential Revision: D44007011
      
      Pulled By: hx235
      
      fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
      151242ce
  16. 21 4月, 2023 1 次提交
    • C
      Always allow L0->L1 trivial move during manual compaction (#11375) · 43e9a60b
      Changyu Bi 提交于
      Summary:
      during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
      - before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
      - after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
      Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
      
      This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
      
      Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
      
      Reviewed By: ajkr
      
      Differential Revision: D44943518
      
      Pulled By: cbi42
      
      fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
      43e9a60b
  17. 08 4月, 2023 1 次提交
    • P
      Refactor block cache tracing w/improved MultiGet (#11339) · f9db0c6e
      Peter Dillinger 提交于
      Summary:
      After https://github.com/facebook/rocksdb/issues/11301, I wasn't sure whether I had regressed block cache tracing with MultiGet. Demo PR https://github.com/facebook/rocksdb/issues/11330 shows the flawed state of tracing MultiGet before my change, and based on the unit test, there was essentially no change in tracing behavior with https://github.com/facebook/rocksdb/issues/11301. This change is to leave that code and behavior better than I found it.
      
      This change is not intended to change any production behaviors except when block cache tracing is active, though might improve general read path efficiency by disabling some related tracking when such tracing is disabled.
      
      More detail on production code:
      * Refactoring to consolidate the construction of BlockCacheTraceRecord, and other related functionality, in block-based table reader, though it's somewhat awkward to preserve an optimization to avoid copying Slices into temporary strings in BlockCacheLookupContext.
      * Accurately track cache hits and misses (etc.) for each data block accessed by a MultiGet(). (Previously reported hits as misses.)
      * Reduced repeated checking of `block_cache_tracer_` state (by creating lookup_context only when active) for efficiency and to reduce the risk of corner case bugs where tracing is enabled or disabled for different parts of a read op. (See a TODO below)
      * Improved estimate calculation for num_keys_in_block (see code comment)
      
      Possible follow-up:
      * `XXX:` use_cache=true means double cache query? (possible double-query of block cache when allow_mmap_reads=true)
      * `TODO:` need more than one lookup_context here to track individual filter and index partition hits and misses
      * `TODO:` optimize more state checks of `block_cache_tracer_` down to `lookup_context != nullptr`
      * Pre-existing `XXX:` There appear to be 'break' statements above that bypass this writing of the block cache trace record
      * Expand test coverage (see below)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11339
      
      Test Plan:
      * Added a basic unit test for block cache tracing MultiGet, for now just covering one data block with two keys.
      * Added HitMissCountingCache to independently verify that the actual block cache trace and expected block cache trace also agree with the actual number of cache hits / misses (nothing missing or mislabeled). For now only used with MultiGet test.
      * Better testing of num_keys_in_block, for now just with MultiGet
      * Misc improvements to table_test to improve clarity, such as making it clear that certain keys are auto-inserted at the start of every test.
      
      Performance test:
      Testing multireadrandom as in https://github.com/facebook/rocksdb/issues/11301, except averaging over distinct runs rather than [-X30] which doesn't seem to sufficiently reset after each run to work as an independent test run.
      
      Base with revert of 11301: 3148926 ops/sec
      Base: 3019146 ops/sec
      New: 2999529 ops/sec
      
      Possibly a tiny MultiGet CPU regression with this change. We are now always allocating an additional vector for the LookupContexts. I'm still contemplating options to try to correct the regression in https://github.com/facebook/rocksdb/issues/11301.
      
      Testing readrandom:
      Base with revert of 11301: 2311988
      Base: 2281726
      New: 2299722
      
      Possibly a tiny Get CPU improvement with this change. We are now avoiding some unnecessary LookupContext population.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44557845
      
      Pulled By: pdillinger
      
      fbshipit-source-id: b841691799d2a48fb59cc8880dc7cbb1e107ae3d
      f9db0c6e
  18. 07 4月, 2023 1 次提交
    • C
      Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340) · b3c43a5b
      Changyu Bi 提交于
      Summary:
      When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340
      
      Test Plan:
      - New unit tests
      - `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.
      
      Reviewed By: ajkr
      
      Differential Revision: D44563884
      
      Pulled By: cbi42
      
      fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
      b3c43a5b
  19. 06 4月, 2023 1 次提交
    • A
      Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328) · 0623c5b9
      anand76 提交于
      Summary:
      VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328
      
      Test Plan: Add a unit test
      
      Reviewed By: pdillinger
      
      Differential Revision: D44718781
      
      Pulled By: anand1976
      
      fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
      0623c5b9
  20. 05 4月, 2023 2 次提交
    • A
      Use user-provided ReadOptions for metadata block reads more often (#11208) · b4573862
      Andrew Kryczka 提交于
      Summary:
      This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information.
      
      I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously.
      
      Fixes https://github.com/facebook/rocksdb/issues/10463
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208
      
      Test Plan:
      Functional:
      - Measured `-verify_checksum=false` applies to metadata blocks read outside of table open
        - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56`
        - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true`
        - before: `rocksdb.block.checksum.compute.count COUNT : 384353`
        - after: `rocksdb.block.checksum.compute.count COUNT : 22`
      
      Performance:
      - Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1`
      - Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads
        - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000`
        - Before: `readrandom [AVG    20 runs] : 274848 (± 3717) ops/sec;    8.4 (± 0.1) MB/sec`
        - After: `readrandom [AVG    20 runs] : 277904 (± 4474) ops/sec;    8.5 (± 0.1) MB/sec`
      
      Reviewed By: hx235
      
      Differential Revision: D43145366
      
      Pulled By: ajkr
      
      fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994
      b4573862
    • P
      Change default block cache from 8MB to 32MB (#11350) · 3c17930e
      Peter Dillinger 提交于
      Summary:
      ... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.
      
      Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.
      
      The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350
      
      Test Plan: unit tests updated
      
      Reviewed By: cbi42
      
      Differential Revision: D44674873
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
      3c17930e
  21. 31 3月, 2023 1 次提交
    • H
      Add `SetAllowStall()` (#11335) · 39c29372
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      - Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()`
      - Misc: some clean up - see PR conversation
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335
      
      Test Plan: - New UT
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44502555
      
      Pulled By: hx235
      
      fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e
      39c29372
  22. 29 3月, 2023 1 次提交
    • H
      Add experimental PerfContext counters for db iterator Prev/Next/Seek* APIs (#11320) · c14eb134
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Motived by user need of investigating db iterator behavior during an interval of any time length of a certain thread, we decide to collect and expose related counters in `PerfContext` as an experimental feature, in addition to the existing db-scope ones (i.e, tickers)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11320
      
      Test Plan:
      - new UT
      - db bench
      
      Setup
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -compression_type=none -bloom_bits=3
      ```
      Test till converges
      ```
      ./db_bench -seed=1679526311157283 -use_existing_db=1 -perf_level=2 -db=/dev/shm/testdb/ -benchmarks="seekrandom[-X60]"
      ```
      pre-change
      `seekrandom [AVG 33 runs] : 7545 (± 100) ops/sec`
      post-change (no regression)
      `seekrandom [AVG 33 runs] : 7688 (± 67) ops/sec`
      
      Reviewed By: cbi42
      
      Differential Revision: D44321931
      
      Pulled By: hx235
      
      fbshipit-source-id: f98a254ba3e3ced95eb5928884e33f1b99dca401
      c14eb134
  23. 28 3月, 2023 2 次提交
  24. 23 3月, 2023 1 次提交
  25. 19 3月, 2023 2 次提交
    • L
      Updates for the 8.1 release (HISTORY, version.h, compatibility tests) (#11307) · 87de4fee
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11307
      
      Reviewed By: hx235
      
      Differential Revision: D44196571
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 52489d6f8bd3c79cd33c87e9e1f719ea5e8bd382
      87de4fee
    • H
      New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) · cb584771
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Users are interested in figuring out what has caused write stall.
      - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`)
      - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager
      - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped`
      
      - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map
      - Misc: rename/comment some write stall InternalStats for clarity
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300
      
      Test Plan:
      - New UT
      - Stress test
      `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1`
      - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time.
      ```
      ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3
      ```
      pre-change:
      ```
      fillseq [AVG 15 runs] : 1176 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations;    0.5 MB/s
      fillseq [AVG 16 runs] : 1162 (± 685) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 17 runs] : 1136 (± 646) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 18 runs] : 1118 (± 610) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 19 runs] : 1100 (± 578) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations;    0.3 MB/s
      fillseq [AVG 20 runs] : 1071 (± 551) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 21 runs] : 1059 (± 525) ops/sec;    0.5 (± 0.3) MB/sec
      ```
      post-change:
      ```
      fillseq [AVG 15 runs] : 1226 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 16 runs] : 1196 (± 687) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 17 runs] : 1174 (± 647) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 18 runs] : 1156 (± 611) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 19 runs] : 1134 (± 579) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 20 runs] : 1119 (± 550) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 21 runs] : 1106 (± 524) ops/sec;    0.6 (± 0.3) MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D44159541
      
      Pulled By: hx235
      
      fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
      cb584771
  26. 18 3月, 2023 2 次提交
    • P
      HyperClockCache support for SecondaryCache, with refactoring (#11301) · 204fcff7
      Peter Dillinger 提交于
      Summary:
      Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
      
      This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
      
      It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
      
      ## cache.h (public API)
      Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
      
      ## advanced_cache.h (advanced public API)
      * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
      * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
      
      These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
      
      * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
      
      I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
      
      ## cacheable_entry.h
      A couple of functions are obsolete because Cache::Handle can no longer be pending.
      
      ## cache.cc
      Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
      
      ## secondary_cache_adapter.{h,cc}
      The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
      
      ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
      Simply updated for Cache API changes.
      
      ## lru_cache.{h,cc}
      Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
      
      ## clock_cache.{h,cc}
      Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
      
      ## block_based_table_reader*
      Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
      
      Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
      
      Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
      
      ## Intended follow-up work
      * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
      * Stacked secondary caches (see above discussion)
      * See if we can make up for the small MultiGet performance regression.
      * Study more performance with SecondaryCache
      * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
      * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
      * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
      
      Test Plan:
      ## Unit tests
      Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
      
      ## Crash/stress test
      Updated to use the new combination.
      
      ## Performance
      First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
      
      ```
      (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
      ```
      
      **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
      HyperClockCache: 3168662 (average parallel ops/sec)
      LRUCache: 2940127
      
      **After** this and #11299, running for about an hour:
      HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
      LRUCache: 2940928 (0.03% faster)
      
      This is an acceptable difference IMHO.
      
      Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
      
      Create DB and test (before and after tests running simultaneously):
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec
      
      So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec
      
      Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
      
      Let's also look at Get() in db_bench:
      
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
      readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
      readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec
      
      That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
      
      **Before**:
      readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
      readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
      readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D44177044
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
      204fcff7
    • A
      Ignore async_io ReadOption if FileSystem doesn't support it (#11296) · eac6b6d0
      anand76 提交于
      Summary:
      In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
      
      Test Plan: Add new unit tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44045776
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
      eac6b6d0
  27. 16 3月, 2023 1 次提交
    • H
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes,... · bab5f9a6
      Hui Xiao 提交于
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
      
      Summary:
      **Context/Summary:**
      We are adding new stats to measure behavior of prefetched tail size and look up into this buffer
      
      The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265
      
      Test Plan:
      **- Piggyback on existing test**
      **- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
      ```
      ./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
      ```
      ```
      rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
      rocksdb.table.open.prefetch.tail.miss COUNT : 91
      rocksdb.table.open.prefetch.tail.hit COUNT : 1034
      ```
      **- No perf regression observed in db_bench**
      
      SETUP command: create same db with ~900 files for pre-change/post-change.
      ```
      ./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
      ```
      TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
      ```
      ./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
      ```
      async io = 0, direct io read = true
      
        | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
      -- | -- | -- | --
      pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
      post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec
      
      async io = 1, direct io read = true
        | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
      -- | -- | -- | --
      pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
      post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D43781467
      
      Pulled By: hx235
      
      fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
      bab5f9a6