1. 06 4月, 2023 1 次提交
    • A
      Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328) · 0623c5b9
      anand76 提交于
      Summary:
      VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328
      
      Test Plan: Add a unit test
      
      Reviewed By: pdillinger
      
      Differential Revision: D44718781
      
      Pulled By: anand1976
      
      fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
      0623c5b9
  2. 05 4月, 2023 2 次提交
    • A
      Use user-provided ReadOptions for metadata block reads more often (#11208) · b4573862
      Andrew Kryczka 提交于
      Summary:
      This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information.
      
      I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously.
      
      Fixes https://github.com/facebook/rocksdb/issues/10463
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208
      
      Test Plan:
      Functional:
      - Measured `-verify_checksum=false` applies to metadata blocks read outside of table open
        - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56`
        - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true`
        - before: `rocksdb.block.checksum.compute.count COUNT : 384353`
        - after: `rocksdb.block.checksum.compute.count COUNT : 22`
      
      Performance:
      - Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1`
      - Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads
        - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000`
        - Before: `readrandom [AVG    20 runs] : 274848 (± 3717) ops/sec;    8.4 (± 0.1) MB/sec`
        - After: `readrandom [AVG    20 runs] : 277904 (± 4474) ops/sec;    8.5 (± 0.1) MB/sec`
      
      Reviewed By: hx235
      
      Differential Revision: D43145366
      
      Pulled By: ajkr
      
      fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994
      b4573862
    • P
      Change default block cache from 8MB to 32MB (#11350) · 3c17930e
      Peter Dillinger 提交于
      Summary:
      ... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.
      
      Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.
      
      The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350
      
      Test Plan: unit tests updated
      
      Reviewed By: cbi42
      
      Differential Revision: D44674873
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
      3c17930e
  3. 31 3月, 2023 1 次提交
    • H
      Add `SetAllowStall()` (#11335) · 39c29372
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      - Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()`
      - Misc: some clean up - see PR conversation
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335
      
      Test Plan: - New UT
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44502555
      
      Pulled By: hx235
      
      fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e
      39c29372
  4. 29 3月, 2023 1 次提交
    • H
      Add experimental PerfContext counters for db iterator Prev/Next/Seek* APIs (#11320) · c14eb134
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Motived by user need of investigating db iterator behavior during an interval of any time length of a certain thread, we decide to collect and expose related counters in `PerfContext` as an experimental feature, in addition to the existing db-scope ones (i.e, tickers)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11320
      
      Test Plan:
      - new UT
      - db bench
      
      Setup
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -compression_type=none -bloom_bits=3
      ```
      Test till converges
      ```
      ./db_bench -seed=1679526311157283 -use_existing_db=1 -perf_level=2 -db=/dev/shm/testdb/ -benchmarks="seekrandom[-X60]"
      ```
      pre-change
      `seekrandom [AVG 33 runs] : 7545 (± 100) ops/sec`
      post-change (no regression)
      `seekrandom [AVG 33 runs] : 7688 (± 67) ops/sec`
      
      Reviewed By: cbi42
      
      Differential Revision: D44321931
      
      Pulled By: hx235
      
      fbshipit-source-id: f98a254ba3e3ced95eb5928884e33f1b99dca401
      c14eb134
  5. 28 3月, 2023 2 次提交
  6. 23 3月, 2023 1 次提交
  7. 19 3月, 2023 2 次提交
    • L
      Updates for the 8.1 release (HISTORY, version.h, compatibility tests) (#11307) · 87de4fee
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11307
      
      Reviewed By: hx235
      
      Differential Revision: D44196571
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 52489d6f8bd3c79cd33c87e9e1f719ea5e8bd382
      87de4fee
    • H
      New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) · cb584771
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Users are interested in figuring out what has caused write stall.
      - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`)
      - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager
      - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped`
      
      - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map
      - Misc: rename/comment some write stall InternalStats for clarity
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300
      
      Test Plan:
      - New UT
      - Stress test
      `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1`
      - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time.
      ```
      ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3
      ```
      pre-change:
      ```
      fillseq [AVG 15 runs] : 1176 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations;    0.5 MB/s
      fillseq [AVG 16 runs] : 1162 (± 685) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 17 runs] : 1136 (± 646) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 18 runs] : 1118 (± 610) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 19 runs] : 1100 (± 578) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations;    0.3 MB/s
      fillseq [AVG 20 runs] : 1071 (± 551) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 21 runs] : 1059 (± 525) ops/sec;    0.5 (± 0.3) MB/sec
      ```
      post-change:
      ```
      fillseq [AVG 15 runs] : 1226 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 16 runs] : 1196 (± 687) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 17 runs] : 1174 (± 647) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 18 runs] : 1156 (± 611) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 19 runs] : 1134 (± 579) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 20 runs] : 1119 (± 550) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 21 runs] : 1106 (± 524) ops/sec;    0.6 (± 0.3) MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D44159541
      
      Pulled By: hx235
      
      fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
      cb584771
  8. 18 3月, 2023 2 次提交
    • P
      HyperClockCache support for SecondaryCache, with refactoring (#11301) · 204fcff7
      Peter Dillinger 提交于
      Summary:
      Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
      
      This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
      
      It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
      
      ## cache.h (public API)
      Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
      
      ## advanced_cache.h (advanced public API)
      * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
      * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
      
      These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
      
      * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
      
      I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
      
      ## cacheable_entry.h
      A couple of functions are obsolete because Cache::Handle can no longer be pending.
      
      ## cache.cc
      Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
      
      ## secondary_cache_adapter.{h,cc}
      The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
      
      ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
      Simply updated for Cache API changes.
      
      ## lru_cache.{h,cc}
      Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
      
      ## clock_cache.{h,cc}
      Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
      
      ## block_based_table_reader*
      Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
      
      Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
      
      Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
      
      ## Intended follow-up work
      * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
      * Stacked secondary caches (see above discussion)
      * See if we can make up for the small MultiGet performance regression.
      * Study more performance with SecondaryCache
      * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
      * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
      * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
      
      Test Plan:
      ## Unit tests
      Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
      
      ## Crash/stress test
      Updated to use the new combination.
      
      ## Performance
      First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
      
      ```
      (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
      ```
      
      **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
      HyperClockCache: 3168662 (average parallel ops/sec)
      LRUCache: 2940127
      
      **After** this and #11299, running for about an hour:
      HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
      LRUCache: 2940928 (0.03% faster)
      
      This is an acceptable difference IMHO.
      
      Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
      
      Create DB and test (before and after tests running simultaneously):
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec
      
      So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
      
      **Before**:
      multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
      **After**:
      multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
      multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec
      
      Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
      
      Let's also look at Get() in db_bench:
      
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
      ```
      
      **Before**:
      readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
      readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
      readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec
      
      That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
      
      **Before**:
      readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
      readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
      **After**:
      readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
      readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D44177044
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
      204fcff7
    • A
      Ignore async_io ReadOption if FileSystem doesn't support it (#11296) · eac6b6d0
      anand76 提交于
      Summary:
      In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
      
      Test Plan: Add new unit tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44045776
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
      eac6b6d0
  9. 16 3月, 2023 1 次提交
    • H
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes,... · bab5f9a6
      Hui Xiao 提交于
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
      
      Summary:
      **Context/Summary:**
      We are adding new stats to measure behavior of prefetched tail size and look up into this buffer
      
      The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265
      
      Test Plan:
      **- Piggyback on existing test**
      **- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
      ```
      ./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
      ```
      ```
      rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
      rocksdb.table.open.prefetch.tail.miss COUNT : 91
      rocksdb.table.open.prefetch.tail.hit COUNT : 1034
      ```
      **- No perf regression observed in db_bench**
      
      SETUP command: create same db with ~900 files for pre-change/post-change.
      ```
      ./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
      ```
      TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
      ```
      ./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
      ```
      async io = 0, direct io read = true
      
        | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
      -- | -- | -- | --
      pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
      post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec
      
      async io = 1, direct io read = true
        | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
      -- | -- | -- | --
      pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
      post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D43781467
      
      Pulled By: hx235
      
      fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
      bab5f9a6
  10. 15 3月, 2023 1 次提交
    • H
      Fix bug of prematurely excluded CF in atomic flush contains unflushed data... · 11cb6af6
      Hui Xiao 提交于
      Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
      
      Summary:
      **Context:**
      Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
      ```
      ./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
          pid=$!
          sleep 0.2
          sleep 10
          kill $pid
          sleep 0.2
      ./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
          pid=$!
          sleep 0.2
          sleep 40
          kill $pid
          sleep 0.2
      
      Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
      Crash-recovery verification failed :(
      No writes or ops?
      Verification failed :(
      ```
      
      The bug is due to the following:
      - When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
      - But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
      - However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
      
      **Summary:**
      - Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
      - For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
      
      Test Plan:
      - New unit test failed before the fix and passes after
      - Make check
      - Rehearsal stress test
      
      Reviewed By: ajkr
      
      Differential Revision: D42799871
      
      Pulled By: hx235
      
      fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
      11cb6af6
  11. 14 3月, 2023 2 次提交
    • L
      Rename a recently added PerfContext counter (#11294) · 49881921
      Levi Tamasi 提交于
      Summary:
      The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D44035964
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
      49881921
    • C
      Support range deletion tombstones in `CreateColumnFamilyWithImport` (#11252) · 9aa3b6f9
      Changyu Bi 提交于
      Summary:
      CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
      1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
      2. it does not handle files with no point keys.
      
      Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).
      
      This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252
      
      Test Plan:
      - added unit tests that fails before this change
      
      Closes https://github.com/facebook/rocksdb/issues/11245
      
      Reviewed By: ajkr
      
      Differential Revision: D43577443
      
      Pulled By: cbi42
      
      fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
      9aa3b6f9
  12. 09 3月, 2023 1 次提交
    • L
      Add a PerfContext counter for merge operands applied in point lookups (#11284) · 1d524385
      Levi Tamasi 提交于
      Summary:
      The existing PerfContext counter `internal_merge_count` only tracks the
      Merge operands applied during range scans. The patch adds a new counter
      called `internal_merge_count_point_lookups` to track the same metric
      for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
      also fixes a couple of cases in the iterator where the existing counter wasn't
      updated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D43926082
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
      1d524385
  13. 02 3月, 2023 1 次提交
    • Y
      Fix backward iteration issue when user defined timestamp is enabled in BlobDB (#11258) · 8dfcfd4e
      Yu Zhang 提交于
      Summary:
      During backward iteration, blob verification would fail because the user key (ts included) in `saved_key_` doesn't match the blob. This happens because during`FindValueForCurrentKey`, `saved_key_` is not updated when the user key(ts not included) is the same for all cases except when `timestamp_lb_` is specified. This breaks the blob verification logic when user defined timestamp is enabled and `timestamp_lb_` is not specified. Fix this by always updating `saved_key_` when a smaller user key (ts included) is seen.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11258
      
      Test Plan:
      `make check`
      `./db_blob_basic_test --gtest_filter=DBBlobWithTimestampTest.IterateBlobs`
      
      Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced with:
      
      `./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5`
      
      Baseline:
      
      - seekrandom [AVG    6 runs] : 72188 (± 1481) ops/sec;   37.2 (± 0.8) MB/sec
      
      With this PR:
      
      - seekrandom [AVG    6 runs] : 74171 (± 1427) ops/sec;   38.2 (± 0.7) MB/sec
      
      Reviewed By: ltamasi
      
      Differential Revision: D43675642
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 8022ae8522d1f66548821855e6eed63640c14e04
      8dfcfd4e
  14. 01 3月, 2023 1 次提交
  15. 23 2月, 2023 2 次提交
    • Y
      Support iter_start_ts in integrated BlobDB (#11244) · f007b8fd
      Yu Zhang 提交于
      Summary:
      Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244
      
      Test Plan:
      ```make check
      ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
      tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```
      
      Reviewed By: ltamasi
      
      Differential Revision: D43506726
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
      f007b8fd
    • C
      Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) · 229297d1
      Changyu Bi 提交于
      Summary:
      A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).
      
      The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113
      
      Test Plan:
      * added unit test in db_range_del_test
      * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`
      
      Reviewed By: ajkr
      
      Differential Revision: D42655709
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
      229297d1
  16. 22 2月, 2023 1 次提交
  17. 18 2月, 2023 2 次提交
    • M
      Remove FactoryFunc from LoadXXXObject (#11203) · b6640c31
      mrambacher 提交于
      Summary:
      The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.
      
      Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203
      
      Reviewed By: cbi42
      
      Differential Revision: D43160255
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
      b6640c31
    • A
      Merge operator failed subcode (#11231) · 25e13652
      Andrew Kryczka 提交于
      Summary:
      From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.
      
      This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231
      
      Test Plan: updated unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43396607
      
      Pulled By: ajkr
      
      fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
      25e13652
  18. 17 2月, 2023 2 次提交
  19. 16 2月, 2023 1 次提交
  20. 10 2月, 2023 1 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
  21. 09 2月, 2023 1 次提交
    • A
      Fix bug in WAL streaming uncompression (#11198) · 77b61abc
      anand76 提交于
      Summary:
      Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198
      
      Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.
      
      Reviewed By: ajkr, cbi42
      
      Differential Revision: D43102692
      
      Pulled By: anand1976
      
      fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd
      77b61abc
  22. 08 2月, 2023 2 次提交
    • L
      Add compaction filter support for wide-column entities (#11196) · 876d2815
      Levi Tamasi 提交于
      Summary:
      The patch adds compaction filter support for wide-column entities by introducing
      a new `CompactionFilter` API called `FilterV3`. This API is called for regular
      key-values, merge operands, and wide-column entities as well. It is passed the
      existing value/operand or wide-column structure and it can update the value or
      columns or keep/delete/etc. the key-value as usual. For compatibility, the default
      implementation of `FilterV3` keeps all wide-column entities and falls back to calling
      `FilterV2` for plain old key-values and merge operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43094147
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd
      876d2815
    • H
      Remove a couple deprecated convenience.h APIs (#11120) · 6650ca24
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As instructed by convenience.h comments, a few deprecated APIs are removed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11120
      
      Test Plan:
      - make check & CI
      - eyeball check on test semantics.
      
      Reviewed By: pdillinger
      
      Differential Revision: D42937507
      
      Pulled By: hx235
      
      fbshipit-source-id: a9e4709387da01b1d0e9148c2e210f02e9746ee1
      6650ca24
  23. 04 2月, 2023 2 次提交
    • P
      Use LIB_MODE=shared build by default with make (#11168) · cf756ed9
      Peter Dillinger 提交于
      Summary:
      With https://github.com/facebook/rocksdb/issues/11150 this becomes a practical change that I think is overall good for developer efficiency.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11168
      
      Test Plan:
      More efficient build of all unit tests and tools:
      
      ```
      $ git clean -fdx
      $ du -sh .
      522M    .
      $ /usr/bin/time make -j32 LIB_MODE=static
      ...
      14270.63user 1043.33system 11:19.85elapsed 2252%CPU (0avgtext+0avgdata 1929944maxresident)k
      ...
      $ du -sh .
      62G     .
      $
      ```
      Vs.
      ```
      $ git clean -fdx
      $ du -sh .
      522M    .
      $ /usr/bin/time make -j32 LIB_MODE=shared
      ...
      9479.87user 478.26system 7:20.82elapsed 2258%CPU (0avgtext+0avgdata 1929272maxresident)k
      ...
      $ du -sh .
      5.4G    .
      $
      ```
      
      So 1/3 less build time and >90% less space usage.
      
      Individual unit test edit-compile-run is not too different. Modifying an average unit test source file:
      ```
      $ touch db/version_builder_test.cc
      $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test
      ...
      34.74user 3.37system 0:38.29elapsed 99%CPU (0avgtext+0avgdata 945520maxresident)k
      ```
      Vs.
      ```
      $ touch db/version_builder_test.cc
      $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test
      ...
      116.26user 43.91system 0:28.65elapsed 559%CPU (0avgtext+0avgdata 675160maxresident)k
      ```
      A little faster with shared.
      
      However, modifying an average DB implementation file has an extra linking step with shared lib:
      ```
      $ touch db/db_impl/db_impl_files.cc
      $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test
      ...
      33.17user 5.13system 0:39.70elapsed 96%CPU (0avgtext+0avgdata 945544maxresident)k
      ```
      Vs.
      ```
      $ touch db/db_impl/db_impl_files.cc
      $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test
      ...
      40.80user 4.66system 0:45.54elapsed 99%CPU (0avgtext+0avgdata 1056340maxresident)k
      ```
      A little slower with shared.
      
      On the whole, should be faster and lighter weight because of the many unit test files case
      
      Reviewed By: cbi42
      
      Differential Revision: D42894004
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9e827e52ace79b86f849b6a24466e318b4b605a7
      cf756ed9
    • P
      Deprecate write_global_seqno and default to false (#11179) · 0cf1008f
      Peter Dillinger 提交于
      Summary:
      This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way.
      
      Also improved API comments.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179
      
      Test Plan: existing tests (with one tiny update)
      
      Reviewed By: hx235
      
      Differential Revision: D42973927
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab
      0cf1008f
  24. 03 2月, 2023 1 次提交
    • A
      Return any errors returned by ReadAsync to the MultiGet caller (#11171) · 63da9cfa
      anand76 提交于
      Summary:
      Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.
      
      Tests:
      Update existing unit tests and add a new one for this scenario
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42950057
      
      Pulled By: anand1976
      
      fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869
      63da9cfa
  25. 02 2月, 2023 1 次提交
  26. 01 2月, 2023 2 次提交
  27. 31 1月, 2023 2 次提交
    • P
      Cleanup, improve, stress test LockWAL() (#11143) · 94e3beec
      Peter Dillinger 提交于
      Summary:
      The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
      
      * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
      * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
      * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
      * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
      * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
      * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
      
      Test Plan: unit tests added / updated, added to stress/crash test
      
      Reviewed By: ajkr
      
      Differential Revision: D42848627
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
      94e3beec
    • Y
      Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8
      Yu Zhang 提交于
      Summary:
      Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
      
      Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
      
      Test Plan:
      make V=1 db_blob_basic_test
      ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
      
      Reviewed By: ltamasi
      
      Differential Revision: D42716487
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
      24ac53d8
  28. 28 1月, 2023 1 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43