1. 16 3月, 2023 1 次提交
    • H
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes,... · bab5f9a6
      Hui Xiao 提交于
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
      
      Summary:
      **Context/Summary:**
      We are adding new stats to measure behavior of prefetched tail size and look up into this buffer
      
      The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265
      
      Test Plan:
      **- Piggyback on existing test**
      **- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
      ```
      ./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
      ```
      ```
      rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
      rocksdb.table.open.prefetch.tail.miss COUNT : 91
      rocksdb.table.open.prefetch.tail.hit COUNT : 1034
      ```
      **- No perf regression observed in db_bench**
      
      SETUP command: create same db with ~900 files for pre-change/post-change.
      ```
      ./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
      ```
      TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
      ```
      ./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
      ```
      async io = 0, direct io read = true
      
        | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
      -- | -- | -- | --
      pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
      post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec
      
      async io = 1, direct io read = true
        | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
      -- | -- | -- | --
      pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
      post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D43781467
      
      Pulled By: hx235
      
      fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
      bab5f9a6
  2. 14 3月, 2023 1 次提交
  3. 09 3月, 2023 1 次提交
    • L
      Add a PerfContext counter for merge operands applied in point lookups (#11284) · 1d524385
      Levi Tamasi 提交于
      Summary:
      The existing PerfContext counter `internal_merge_count` only tracks the
      Merge operands applied during range scans. The patch adds a new counter
      called `internal_merge_count_point_lookups` to track the same metric
      for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
      also fixes a couple of cases in the iterator where the existing counter wasn't
      updated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284
      
      Test Plan: `make check`
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D43926082
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
      1d524385
  4. 07 3月, 2023 1 次提交
    • P
      Tests verifying non-zero checksums of zero bytes (#11260) · e0107325
      Peter Dillinger 提交于
      Summary:
      Adds unit tests verifying that a block payload and checksum of all zeros is not falsely considered valid data. The test exhaustively checks that for blocks up to some length (default 20K, more exhaustively 10M) of all zeros do not produce a block checksum of all zeros.
      
      Also small refactoring of an existing checksum test to use parameterized test. (Suggest hiding whitespace changes for review.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11260
      
      Test Plan:
      this is the test, manual run with
      `ROCKSDB_THOROUGH_CHECKSUM_TEST=1` to verify up to 10M.
      
      Reviewed By: hx235
      
      Differential Revision: D43706192
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 95e721c320ca928e7fa2400c2570fb359cc30b1f
      e0107325
  5. 04 3月, 2023 1 次提交
    • C
      Improve documentation for MergingIterator (#11161) · d053926f
      Changyu Bi 提交于
      Summary:
      Add some comments to try to explain how/why MergingIterator works. Made some small refactoring, mostly in MergingIterator::SkipNextDeleted() and MergingIterator::SeekImpl().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11161
      
      Test Plan:
      crash test with small key range:
      ```
      python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=6000 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D42860994
      
      Pulled By: cbi42
      
      fbshipit-source-id: 3f0c1c9c6481a7f468bf79d823998907a8116e9e
      d053926f
  6. 25 2月, 2023 1 次提交
  7. 23 2月, 2023 1 次提交
    • C
      Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) · 229297d1
      Changyu Bi 提交于
      Summary:
      A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).
      
      The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113
      
      Test Plan:
      * added unit test in db_range_del_test
      * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`
      
      Reviewed By: ajkr
      
      Differential Revision: D42655709
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
      229297d1
  8. 18 2月, 2023 3 次提交
    • M
      Remove FactoryFunc from LoadXXXObject (#11203) · b6640c31
      mrambacher 提交于
      Summary:
      The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.
      
      Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203
      
      Reviewed By: cbi42
      
      Differential Revision: D43160255
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
      b6640c31
    • A
      Merge operator failed subcode (#11231) · 25e13652
      Andrew Kryczka 提交于
      Summary:
      From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.
      
      This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231
      
      Test Plan: updated unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43396607
      
      Pulled By: ajkr
      
      fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
      25e13652
    • A
      Use CacheDependencies() at start of ApproximateKeyAnchors() (#11230) · 6aef1a05
      Andrew Kryczka 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11230
      
      Test Plan:
      - setup command: `$ ./db_bench -benchmarks=fillrandom,compact -compression_type=none -num=1000000 -write_buffer_size=4194304 -target_file_size_base=4194304 -use_direct_io_for_flush_and_compaction=true -partition_index_and_filters=true -bloom_bits=10 -metadata_block_size=1024`
      - measure small read count bucketed by size: `$ strace -fye pread64 ./db_bench.ctrl -use_existing_db=true -benchmarks=compact -compaction_readahead_size=4194304 -compression_type=none -num=1000000 -write_buffer_size=4194304 -target_file_size_base=4194304 -use_direct_io_for_flush_and_compaction=true -partition_index_and_filters=true -bloom_bits=10 -metadata_block_size=1024  -subcompactions=4 -cache_size=1048576000  2>&1 >/dev/null | awk '/= [0-9]+$/{print "[", int($NF / 1024), "KB,", int(1 + $NF / 1024), "KB)"}' | sort -n -k 2 | uniq -c | head -3`
      - before:
      ```
         1119 [ 0 KB, 1 KB)
            1 [ 6 KB, 7 KB)
            2 [ 7 KB, 8 KB)
      ```
      - after:
      ```
          242 [ 0 KB, 1 KB)
            1 [ 6 KB, 7 KB)
            2 [ 7 KB, 8 KB)
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D43388507
      
      Pulled By: ajkr
      
      fbshipit-source-id: a02413c9f615b00784700646825a9870ee10f3a7
      6aef1a05
  9. 16 2月, 2023 1 次提交
    • L
      Add a new MultiGetEntity API (#11222) · 9794acb5
      Levi Tamasi 提交于
      Summary:
      The new `MultiGetEntity` API can be used to get a consistent view of
      a batch of keys, with the results presented as wide-column entities.
      Similarly to `GetEntity` and the iterator's `columns` API, if the entry
      corresponding to the key is a wide-column entity to start with, it is
      returned as-is, and if it is a plain key-value, it is wrapped into an entity
      with a single default column.
      
      Implementation-wise, the new API shares the logic of the batched `MultiGet`
      API (via the `MultiGetCommon` methods). Both single-CF and multi-CF
      `MultiGetEntity` APIs are provided, and blobs are also supported.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D43256950
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005
      9794acb5
  10. 13 2月, 2023 1 次提交
    • W
      remove dependency on options.h for port_posix.h andport_win.h (#11214) · 42d6652b
      Wentian Guo 提交于
      Summary:
      The files in `port/`, such as `port_posix.h`, are layering over the system libraries, so shouldn't include the DB-specific files like `options.h`. This PR remove this dependency.
      
      # How
      The reason that `port_posix.h` (or `port_win.h`) include `options.h` is to use `CpuPriority`, as there is a method `SetCpuPriority()` in `port_posix.h` that uses `CpuPriority.`
      - I think `SetCpuPriority()` make sense to exist in `port_posix.h` as it provides has platform-dependent implementation
      - `CpuPriority` enum is defined in `env.h`, but used in `rocksdb/include` and `port/`.
      
      Hence, let us define `CpuPriority` enum in a common file, say `port_defs.h`, such that both directories `rocksdb/include` and `port/` can include.
      
      When we remove this dependency, some other files have compile errors because they can't find definitions, so add header files to resolve
      
      # Test
      make all check -j
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11214
      
      Reviewed By: pdillinger
      
      Differential Revision: D43196910
      
      Pulled By: guowentian
      
      fbshipit-source-id: 70deccb72844cfb08fcc994f76c6ef6df5d55ab9
      42d6652b
  11. 10 2月, 2023 1 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
  12. 08 2月, 2023 1 次提交
  13. 31 1月, 2023 1 次提交
    • Y
      Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8
      Yu Zhang 提交于
      Summary:
      Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
      
      Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
      
      Test Plan:
      make V=1 db_blob_basic_test
      ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
      
      Reviewed By: ltamasi
      
      Differential Revision: D42716487
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
      24ac53d8
  14. 28 1月, 2023 1 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
  15. 27 1月, 2023 1 次提交
  16. 26 1月, 2023 1 次提交
  17. 25 1月, 2023 1 次提交
  18. 21 1月, 2023 1 次提交
    • A
      Add API to limit blast radius of merge operator failure (#11092) · b7fbcefd
      Andrew Kryczka 提交于
      Summary:
      Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.
      
      To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092
      
      Reviewed By: siying
      
      Differential Revision: D42525673
      
      Pulled By: ajkr
      
      fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
      b7fbcefd
  19. 20 1月, 2023 1 次提交
  20. 14 1月, 2023 1 次提交
  21. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  22. 22 12月, 2022 1 次提交
    • A
      Avoid mixing sync and async prefetch (#11050) · bec42648
      anand76 提交于
      Summary:
      Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050
      
      Test Plan: Crash test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42194682
      
      Pulled By: anand1976
      
      fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e
      bec42648
  23. 16 12月, 2022 1 次提交
    • C
      Consider range tombstone in compaction output file cutting (#10802) · f02c708a
      Changyu Bi 提交于
      Summary:
      This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802
      
      Test Plan:
      - added unit tests in db_range_del_test.
      - stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`
      
      Reviewed By: ajkr, jay-zhuang
      
      Differential Revision: D40308827
      
      Pulled By: cbi42
      
      fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
      f02c708a
  24. 13 12月, 2022 1 次提交
  25. 10 12月, 2022 1 次提交
    • P
      Improve error messages for SST footer and size errors (#11009) · 433d7e45
      Peter Dillinger 提交于
      Summary:
      Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
      * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
      * Footer errors are reported with affected file.
      * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009
      
      Test Plan:
      unit tests added, some manual
      
      Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).
      
      Reviewed By: siying
      
      Differential Revision: D41656272
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df
      433d7e45
  26. 29 11月, 2022 1 次提交
    • C
      Remove copying of range tombstones keys in iterator (#10878) · 6cdb7af9
      Changyu Bi 提交于
      Summary:
      In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878
      
      Test Plan:
      - existing UT
      - ran all flavors of stress test through sandcastle
      - benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
      ```
      # Favorable set up: half of the writes are DeleteRange.
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50
      
      # benchmark command
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472  --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25
      
      # main
      readseq [AVG    5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
      readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec
      
      # this PR
      readseq [AVG    5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
      readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40711170
      
      Pulled By: cbi42
      
      fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
      6cdb7af9
  27. 24 11月, 2022 1 次提交
    • C
      Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966) · 534fb06d
      Changyu Bi 提交于
      Summary:
      Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966
      
      Test Plan:
      - added unit test
      - stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
      - ran different stress tests over sandcastle
      - Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.
      
      Reviewed By: ajkr
      
      Differential Revision: D41414172
      
      Pulled By: cbi42
      
      fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
      534fb06d
  28. 12 11月, 2022 1 次提交
    • P
      Don't attempt to use SecondaryCache on block_cache_compressed (#10944) · f321e8fc
      Peter Dillinger 提交于
      Summary:
      Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!
      
      This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.
      
      Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.
      
      This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944
      
      Test Plan: test added that fails on base revision (at least with ASAN)
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41205578
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
      f321e8fc
  29. 10 11月, 2022 1 次提交
    • L
      Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932) · 2ea10952
      Levi Tamasi 提交于
      Summary:
      The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
      a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
      overload, which makes the responsibilities clearer by making sure the query
      result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
      `SaveValue` and `GetContext`. Second, it changes the interface of
      `TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
      is a more decoupled design which will come in handy when adding support
      for `Merge` with wide-column entities to `DBIter`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41129399
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
      2ea10952
  30. 08 11月, 2022 1 次提交
    • L
      Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925) · fbd9077d
      Levi Tamasi 提交于
      Summary:
      The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D41096453
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
      fbd9077d
  31. 03 11月, 2022 1 次提交
    • L
      Support Merge for wide-column entities during point lookups (#10916) · 941d8347
      Levi Tamasi 提交于
      Summary:
      The patch adds `Merge` support for wide-column entities to the point lookup
      APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
      update the iterator and compaction logic in separate PRs.) In terms of semantics,
      the `Merge` operation is applied to the default (anonymous) column; any other
      columns in the entity are unaffected.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D40962311
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
      941d8347
  32. 01 11月, 2022 1 次提交
    • Y
      Basic Support for Merge with user-defined timestamp (#10819) · 7d26e4c5
      Yanqin Jin 提交于
      Summary:
      This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.
      
      Simplest usage:
      ```cpp
      // assume string append merge op is used with '.' as delimiter.
      // ts1 < ts2
      db->Put(WriteOptions(), "key", ts1, "v0");
      db->Merge(WriteOptions(), "key", ts2, "1");
      ReadOptions ro;
      ro.timestamp = &ts2;
      db->Get(ro, "key", &value);
      ASSERT_EQ("v0.1", value);
      ```
      
      Some code comments are added for clarity.
      
      Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D40603195
      
      Pulled By: riversand963
      
      fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
      7d26e4c5
  33. 29 10月, 2022 3 次提交
    • Y
      Fix deletion counting in memtable stats (#10886) · 9079895a
      Yanqin Jin 提交于
      Summary:
      Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D40740754
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
      9079895a
    • J
      Fix a Windows build error (#10897) · 36f5e19e
      Jay Zhuang 提交于
      Summary:
      The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`.
      
      ```
      \table\merging_iterator.cc(823): error C2220: the following warning is treated as an error
      \table\merging_iterator.cc(823): warning C4702: unreachable code
      \table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error
      \table\merging_iterator.cc(1030): warning C4702: unreachable code
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897
      
      Reviewed By: cbi42
      
      Differential Revision: D40811790
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df
      36f5e19e
    • L
      Handle Merges correctly in GetEntity (#10894) · 7867a111
      Levi Tamasi 提交于
      Summary:
      The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
      supported for wide-column entities written using `PutEntity`; this change is
      about returning correct (i.e. consistent with `Get`) results in cases like when the
      base value is a plain old key-value written using `Put` or when there is no real base
      value because we hit either a tombstone or the beginning of history.
      
      Implementation-wise, the patch introduces a new wrapper around the existing
      `MergeHelper::TimedFullMerge` that can store the merge result in either a string
      (for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D40782708
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
      7867a111
  34. 28 10月, 2022 2 次提交
    • C
      Reduce heap operations for range tombstone keys in iterator (#10877) · 56715350
      Changyu Bi 提交于
      Summary:
      Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877
      
      Test Plan:
      - existing UT
      - ran all flavors of stress test through sandcastle
      - benchmark:
      ```
      # Set up: --writes_per_range_tombstone=1 means one point write and one delete range
      
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64
      
      Level Files Size(MB)
      --------------------
        0        8      152
        1        0        0
        2        0        0
        3        0        0
        4        0        0
        5        0        0
        6        0        0
      
      # Benchmark
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true
      
      # Pre PR
      readseq [AVG    5 runs] : 1432116 (± 59664) ops/sec;  224.0 (± 9.3) MB/sec
      readseq [MEDIAN 5 runs] : 1454886 ops/sec;  227.5 MB/sec
      
      # Post PR
      readseq [AVG    5 runs] : 1944425 (± 29521) ops/sec;  304.1 (± 4.6) MB/sec
      readseq [MEDIAN 5 runs] : 1959430 ops/sec;  306.5 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40710936
      
      Pulled By: cbi42
      
      fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
      56715350
    • S
      sst_dump --command=raw to add index offset information (#10873) · 3e686c7c
      sdong 提交于
      Summary:
      Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873
      
      Test Plan: Manually run it against a file and check the output.
      
      Reviewed By: anand1976
      
      Differential Revision: D40742289
      
      fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93
      3e686c7c
  35. 26 10月, 2022 1 次提交