1. 10 2月, 2023 1 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
  2. 08 2月, 2023 1 次提交
  3. 31 1月, 2023 1 次提交
    • Y
      Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8
      Yu Zhang 提交于
      Summary:
      Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
      
      Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
      
      Test Plan:
      make V=1 db_blob_basic_test
      ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
      
      Reviewed By: ltamasi
      
      Differential Revision: D42716487
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
      24ac53d8
  4. 28 1月, 2023 1 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
  5. 27 1月, 2023 1 次提交
  6. 26 1月, 2023 1 次提交
  7. 25 1月, 2023 1 次提交
  8. 21 1月, 2023 1 次提交
    • A
      Add API to limit blast radius of merge operator failure (#11092) · b7fbcefd
      Andrew Kryczka 提交于
      Summary:
      Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.
      
      To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092
      
      Reviewed By: siying
      
      Differential Revision: D42525673
      
      Pulled By: ajkr
      
      fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
      b7fbcefd
  9. 20 1月, 2023 1 次提交
  10. 14 1月, 2023 1 次提交
  11. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  12. 22 12月, 2022 1 次提交
    • A
      Avoid mixing sync and async prefetch (#11050) · bec42648
      anand76 提交于
      Summary:
      Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050
      
      Test Plan: Crash test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42194682
      
      Pulled By: anand1976
      
      fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e
      bec42648
  13. 16 12月, 2022 1 次提交
    • C
      Consider range tombstone in compaction output file cutting (#10802) · f02c708a
      Changyu Bi 提交于
      Summary:
      This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802
      
      Test Plan:
      - added unit tests in db_range_del_test.
      - stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`
      
      Reviewed By: ajkr, jay-zhuang
      
      Differential Revision: D40308827
      
      Pulled By: cbi42
      
      fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
      f02c708a
  14. 13 12月, 2022 1 次提交
  15. 10 12月, 2022 1 次提交
    • P
      Improve error messages for SST footer and size errors (#11009) · 433d7e45
      Peter Dillinger 提交于
      Summary:
      Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
      * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
      * Footer errors are reported with affected file.
      * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009
      
      Test Plan:
      unit tests added, some manual
      
      Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).
      
      Reviewed By: siying
      
      Differential Revision: D41656272
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df
      433d7e45
  16. 29 11月, 2022 1 次提交
    • C
      Remove copying of range tombstones keys in iterator (#10878) · 6cdb7af9
      Changyu Bi 提交于
      Summary:
      In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878
      
      Test Plan:
      - existing UT
      - ran all flavors of stress test through sandcastle
      - benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
      ```
      # Favorable set up: half of the writes are DeleteRange.
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50
      
      # benchmark command
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472  --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25
      
      # main
      readseq [AVG    5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
      readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec
      
      # this PR
      readseq [AVG    5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
      readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40711170
      
      Pulled By: cbi42
      
      fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
      6cdb7af9
  17. 24 11月, 2022 1 次提交
    • C
      Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966) · 534fb06d
      Changyu Bi 提交于
      Summary:
      Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966
      
      Test Plan:
      - added unit test
      - stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
      - ran different stress tests over sandcastle
      - Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.
      
      Reviewed By: ajkr
      
      Differential Revision: D41414172
      
      Pulled By: cbi42
      
      fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
      534fb06d
  18. 12 11月, 2022 1 次提交
    • P
      Don't attempt to use SecondaryCache on block_cache_compressed (#10944) · f321e8fc
      Peter Dillinger 提交于
      Summary:
      Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!
      
      This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.
      
      Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.
      
      This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944
      
      Test Plan: test added that fails on base revision (at least with ASAN)
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41205578
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
      f321e8fc
  19. 10 11月, 2022 1 次提交
    • L
      Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932) · 2ea10952
      Levi Tamasi 提交于
      Summary:
      The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
      a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
      overload, which makes the responsibilities clearer by making sure the query
      result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
      `SaveValue` and `GetContext`. Second, it changes the interface of
      `TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
      is a more decoupled design which will come in handy when adding support
      for `Merge` with wide-column entities to `DBIter`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41129399
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
      2ea10952
  20. 08 11月, 2022 1 次提交
    • L
      Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925) · fbd9077d
      Levi Tamasi 提交于
      Summary:
      The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D41096453
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
      fbd9077d
  21. 03 11月, 2022 1 次提交
    • L
      Support Merge for wide-column entities during point lookups (#10916) · 941d8347
      Levi Tamasi 提交于
      Summary:
      The patch adds `Merge` support for wide-column entities to the point lookup
      APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
      update the iterator and compaction logic in separate PRs.) In terms of semantics,
      the `Merge` operation is applied to the default (anonymous) column; any other
      columns in the entity are unaffected.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D40962311
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
      941d8347
  22. 01 11月, 2022 1 次提交
    • Y
      Basic Support for Merge with user-defined timestamp (#10819) · 7d26e4c5
      Yanqin Jin 提交于
      Summary:
      This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.
      
      Simplest usage:
      ```cpp
      // assume string append merge op is used with '.' as delimiter.
      // ts1 < ts2
      db->Put(WriteOptions(), "key", ts1, "v0");
      db->Merge(WriteOptions(), "key", ts2, "1");
      ReadOptions ro;
      ro.timestamp = &ts2;
      db->Get(ro, "key", &value);
      ASSERT_EQ("v0.1", value);
      ```
      
      Some code comments are added for clarity.
      
      Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D40603195
      
      Pulled By: riversand963
      
      fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
      7d26e4c5
  23. 29 10月, 2022 3 次提交
    • Y
      Fix deletion counting in memtable stats (#10886) · 9079895a
      Yanqin Jin 提交于
      Summary:
      Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D40740754
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
      9079895a
    • J
      Fix a Windows build error (#10897) · 36f5e19e
      Jay Zhuang 提交于
      Summary:
      The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`.
      
      ```
      \table\merging_iterator.cc(823): error C2220: the following warning is treated as an error
      \table\merging_iterator.cc(823): warning C4702: unreachable code
      \table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error
      \table\merging_iterator.cc(1030): warning C4702: unreachable code
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897
      
      Reviewed By: cbi42
      
      Differential Revision: D40811790
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df
      36f5e19e
    • L
      Handle Merges correctly in GetEntity (#10894) · 7867a111
      Levi Tamasi 提交于
      Summary:
      The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
      supported for wide-column entities written using `PutEntity`; this change is
      about returning correct (i.e. consistent with `Get`) results in cases like when the
      base value is a plain old key-value written using `Put` or when there is no real base
      value because we hit either a tombstone or the beginning of history.
      
      Implementation-wise, the patch introduces a new wrapper around the existing
      `MergeHelper::TimedFullMerge` that can store the merge result in either a string
      (for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D40782708
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
      7867a111
  24. 28 10月, 2022 2 次提交
    • C
      Reduce heap operations for range tombstone keys in iterator (#10877) · 56715350
      Changyu Bi 提交于
      Summary:
      Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877
      
      Test Plan:
      - existing UT
      - ran all flavors of stress test through sandcastle
      - benchmark:
      ```
      # Set up: --writes_per_range_tombstone=1 means one point write and one delete range
      
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64
      
      Level Files Size(MB)
      --------------------
        0        8      152
        1        0        0
        2        0        0
        3        0        0
        4        0        0
        5        0        0
        6        0        0
      
      # Benchmark
      TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true
      
      # Pre PR
      readseq [AVG    5 runs] : 1432116 (± 59664) ops/sec;  224.0 (± 9.3) MB/sec
      readseq [MEDIAN 5 runs] : 1454886 ops/sec;  227.5 MB/sec
      
      # Post PR
      readseq [AVG    5 runs] : 1944425 (± 29521) ops/sec;  304.1 (± 4.6) MB/sec
      readseq [MEDIAN 5 runs] : 1959430 ops/sec;  306.5 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40710936
      
      Pulled By: cbi42
      
      fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
      56715350
    • S
      sst_dump --command=raw to add index offset information (#10873) · 3e686c7c
      sdong 提交于
      Summary:
      Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873
      
      Test Plan: Manually run it against a file and check the output.
      
      Reviewed By: anand1976
      
      Differential Revision: D40742289
      
      fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93
      3e686c7c
  25. 26 10月, 2022 2 次提交
    • A
      Format files under table/ by clang-format (#10852) · 727bad78
      anand76 提交于
      Summary:
      Run clang-format on files under the `table` directory.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10852
      
      Reviewed By: ajkr
      
      Differential Revision: D40650732
      
      Pulled By: anand1976
      
      fbshipit-source-id: 2023a958e37fd6274040c5181130284600c9e0ef
      727bad78
    • C
      Improve FragmentTombstones() speed by lazily initializing `seq_set_` (#10848) · 7a959388
      Changyu Bi 提交于
      Summary:
      FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848
      
      Test Plan:
      - Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple  --verify_iterator_with_expected_state_one_in=5`.
      - Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
      ```
      ./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true
      
      Before this PR:
      =========================
      Fragment Tombstones:     270.286 us
      AddTombstones:           1.28933 us
      ShouldDelete (first):    0.525528 us
      ShouldDelete (rest):     0.0797519 us
      
      After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
      =========================
      Fragment Tombstones:     149.879 us
      AddTombstones:           102.131 us
      ShouldDelete (first):    0.565871 us
      ShouldDelete (rest):     0.0729444 us
      ```
      - db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
      ```
      ./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25
      
      Before this PR:
      readwhilewriting :      18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations;   28.1 MB/s (41001 of 250000 found)
      After this PR:
      readwhilewriting :      16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations;   23.8 MB/s (28977 of 250000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40646227
      
      Pulled By: cbi42
      
      fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
      7a959388
  26. 24 10月, 2022 1 次提交
    • C
      Remove range tombstone test code from sst_file_reader (#10847) · deb6a24b
      Changyu Bi 提交于
      Summary:
      `#include "db/range_tombstone_fragmenter.h"` seems to break some internal test for 7.8 release. I'm removing it from sst_file_reader.h for now to unblock release. This should be fine as it is only used in a unit test for DeleteRange with timestamp. In addition, it does not seem to be useful to support delete range for sst file writer, since the range tombstone won't cover any key (its sequence number is 0). So maybe we can remove it in the future.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10847
      
      Test Plan: CI.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D40620865
      
      Pulled By: cbi42
      
      fbshipit-source-id: be44b2f31e062bff87ed1b8d94482c3f7eaa370c
      deb6a24b
  27. 22 10月, 2022 2 次提交
    • P
      Use kXXH3 as default checksum (CPU efficiency) (#10778) · 27c9705a
      Peter Dillinger 提交于
      Summary:
      Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware.
      
      A current DB performance comparison using buck+clang build:
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal
      ```
      kXXH3 (+0.2% DB write throughput):
      `fillseq [AVG    1000 runs] : 822149 (± 1004) ops/sec;   91.0 (± 0.1) MB/sec`
      kCRC32c:
      `fillseq [AVG    1000 runs] : 820484 (± 1203) ops/sec;   90.8 (± 0.1) MB/sec`
      
      Micro benchmark comparison:
      ```
      ./db_bench --benchmarks=xxh3[-X20],crc32c[-X20]
      ```
      Machine 1, buck+clang build:
      `xxh3 [AVG    20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec`
      `crc32c [AVG    20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec`
      
      Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0:
      `xxh3 [AVG    20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec`
      `crc32c [AVG    20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778
      
      Test Plan: make check, unit tests updated
      
      Reviewed By: ajkr
      
      Differential Revision: D40112510
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37
      27c9705a
    • A
      Refactor block cache tracing APIs (#10811) · 0e7b27bf
      akankshamahajan 提交于
      Summary:
      Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.
      
      `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
      New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.
      
      This same philosophy can be applied to KV and IO tracing as well.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811
      
      Test Plan:
      existing unit tests
      Old API DB::StartBlockTrace checked with db_bench tool
      create database
      ```
      ./db_bench --benchmarks="fillseq" \
      --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
      --cache_index_and_filter_blocks --cache_size=1048576 \
      --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
      --min_level_to_compress=-1 --compression_ratio=1 --num=10000000
      ```
      
      To trace block cache accesses when running readrandom benchmark:
      ```
      ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
      --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
      --cache_index_and_filter_blocks --cache_size=1048576 \
      --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
      --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
      --threads=16 \
      -block_cache_trace_file="/tmp/binary_trace_test_example" \
      -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
      -block_cache_trace_sampling_frequency=1
      
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D40435289
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
      0e7b27bf
  28. 19 10月, 2022 1 次提交
    • P
      Refactor ShardedCache for more sharing, static polymorphism (#10801) · 7555243b
      Peter Dillinger 提交于
      Summary:
      The motivations for this change include
      * Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size.
        * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash.
      * Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads
        * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks).
        * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache.
      
      More detail:
      * Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`.
      * Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards.
      * Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801
      
      Test Plan:
      Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash.
      
      Performance:
      Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16`
      
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16
      ```
      
      Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec`
      After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec`
      
      So possibly ~0.1% improvement.
      
      And with `-cache_type=hyper_clock_cache`:
      Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec`
      After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec`
      
      So roughly 5% improvement!
      
      Reviewed By: anand1976
      
      Differential Revision: D40252236
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b
      7555243b
  29. 18 10月, 2022 1 次提交
    • P
      Print stack traces on frozen tests in CI (#10828) · e466173d
      Peter Dillinger 提交于
      Summary:
      Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
      
      For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
      
      Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
      
      Reviewed By: hx235
      
      Differential Revision: D40447634
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
      e466173d
  30. 01 10月, 2022 1 次提交
    • C
      User-defined timestamp support for `DeleteRange()` (#10661) · 9f2363f4
      Changyu Bi 提交于
      Summary:
      Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
      - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
      - Garbage collection of range tombstones and range tombstone covered keys during compaction.
      - Get()/MultiGet() to return the timestamp of a range tombstone when needed.
      - Get/Iterator with range tombstones bounded by readoptions.timestamp.
      - timestamp crash test now issues DeleteRange by default.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
      
      Test Plan:
      - Added unit test: `make check`
      - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
      - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.
      
      | micros/op | fillrandom | seekrandom |
      | --- | --- | --- |
      |main| 2.58 |10.96|
      |PR 10661| 2.68 |10.63|
      
      Reviewed By: riversand963
      
      Differential Revision: D39441192
      
      Pulled By: cbi42
      
      fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
      9f2363f4
  31. 30 9月, 2022 1 次提交
    • J
      Align compaction output file boundaries to the next level ones (#10655) · f3cc6663
      Jay Zhuang 提交于
      Summary:
      Try to align the compaction output file boundaries to the next level ones
      (grandparent level), to reduce the level compaction write-amplification.
      
      In level compaction, there are "wasted" data at the beginning and end of the
      output level files. Align the file boundary can avoid such "wasted" compaction.
      With this PR, it tries to align the non-bottommost level file boundaries to its
      next level ones. It may cut file when the file size is large enough (at least
      50% of target_file_size) and not too large (2x target_file_size).
      
      db_bench shows about 12.56% compaction reduction:
      ```
      TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
      
      # baseline:
      Flush(GB): cumulative 25.882, interval 7.216
      Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
      
      # with this change:
      Flush(GB): cumulative 25.882, interval 7.753
      Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
      ```
      
      The compaction simulator shows a similar result (14% with 100G random data).
      As a side effect, with this PR, the SST file size can exceed the
      target_file_size, but is capped at 2x target_file_size. And there will be
      smaller files. Here are file size statistics when loading 100GB with the target
      file size 32MB:
      ```
                baseline      this_PR
      count  1.656000e+03  1.705000e+03
      mean   3.116062e+07  3.028076e+07
      std    7.145242e+06  8.046139e+06
      ```
      
      The feature is enabled by default, to revert to the old behavior disable it
      with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
      
      Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
      use case like user adding 2 or more non-overlapping data range at the same
      time, it can reduce the overlapping of 2 datasets in the lower levels.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
      
      Reviewed By: cbi42
      
      Differential Revision: D39552321
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
      f3cc6663
  32. 24 9月, 2022 1 次提交
    • Y
      Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724) · 07249fea
      Yanqin Jin 提交于
      Summary:
      Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724
      
      Test Plan: make check
      
      Reviewed By: cbi42
      
      Differential Revision: D39756847
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876
      07249fea
  33. 23 9月, 2022 1 次提交
    • P
      Refactor to avoid confusing "raw block" (#10408) · ef443cea
      Peter Dillinger 提交于
      Summary:
      We have a lot of confusing code because of mixed, sometimes
      completely opposite uses of of the term "raw block" or "raw contents",
      sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
      `raw_block_contents` and `raw_size` generally referred to uncompressed block
      contents and size, while `WriteRawBlock` referred to writing a block that
      is already compressed if it is going to be. Meanwhile, in
      `BlockBasedTable`, `raw_block_contents` either referred to a (maybe
      compressed) block with trailer, or a maybe compressed block maybe
      without trailer. (Note: left as follow-up work to use C++ typing to
      better sort out the various kinds of BlockContents.)
      
      This change primarily tries to apply some consistent terminology around
      the kinds of block representations, avoiding the unclear "raw". (Any
      meaning of "raw" assumes some bias toward the storage layer or toward
      the logical data layer.) Preferred terminology:
      
      * **Serialized block** - bytes that go into storage. For block-based table
      (usually the case) this includes the block trailer. WART: block `size` may or
      may not include the trailer; need to be clear about whether it does or not.
      * **Maybe compressed block** - like a serialized block, but without the
      trailer (or no promise of including a trailer). Must be accompanied by a
      CompressionType.
      * **Uncompressed block** - "payload" bytes that are either stored with no
      compression, used as input to compression function, or result of
      decompression function.
      * **Parsed block** - an in-memory form of a block in block cache, as it is
      used by the table reader. Different C++ types are used depending on the
      block type (see block_like_traits.h).
      
      Other refactorings:
      * Misc corrections/improvements of internal API comments
      * Remove a few misleading / unhelpful / redundant comments.
      * Use move semantics in some places to simplify contracts
      * Use better parameter names to indicate which parameters are used for
      outputs
      * Remove some extraneous `extern`
      * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
      
      Test Plan: existing tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38172617
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
      ef443cea
  34. 22 9月, 2022 2 次提交
    • A
      Fix platform 10 build with folly (#10708) · fb9a0258
      anand76 提交于
      Summary:
      Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708
      
      Test Plan:
      ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check
      ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check
      
      Reviewed By: ajkr
      
      Differential Revision: D39666590
      
      Pulled By: anand1976
      
      fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44
      fb9a0258
    • C
      Fix memtable-only iterator regression (#10705) · 749b849a
      Changyu Bi 提交于
      Summary:
      when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705
      
      Test Plan:
      - `make check`
      - Performance:
        - Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
        - Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
      ```
      seek_nexts    main op/sec    https://github.com/facebook/rocksdb/issues/10705      RocksDB v7.6
      0             5746568        5749033     5786180
      30            2411690        3006466     2837699
      1000          102556         128902      124667
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D39644221
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
      749b849a