1. 16 3月, 2023 1 次提交
    • H
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes,... · bab5f9a6
      Hui Xiao 提交于
      Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
      
      Summary:
      **Context/Summary:**
      We are adding new stats to measure behavior of prefetched tail size and look up into this buffer
      
      The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265
      
      Test Plan:
      **- Piggyback on existing test**
      **- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
      ```
      ./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
      ```
      ```
      rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
      rocksdb.table.open.prefetch.tail.miss COUNT : 91
      rocksdb.table.open.prefetch.tail.hit COUNT : 1034
      ```
      **- No perf regression observed in db_bench**
      
      SETUP command: create same db with ~900 files for pre-change/post-change.
      ```
      ./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
      ```
      TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
      ```
      ./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
      ```
      async io = 0, direct io read = true
      
        | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
      -- | -- | -- | --
      pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
      post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec
      
      async io = 1, direct io read = true
        | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
      -- | -- | -- | --
      pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
      post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D43781467
      
      Pulled By: hx235
      
      fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
      bab5f9a6
  2. 25 1月, 2023 1 次提交
  3. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  4. 26 10月, 2022 1 次提交
    • C
      Improve FragmentTombstones() speed by lazily initializing `seq_set_` (#10848) · 7a959388
      Changyu Bi 提交于
      Summary:
      FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848
      
      Test Plan:
      - Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple  --verify_iterator_with_expected_state_one_in=5`.
      - Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
      ```
      ./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true
      
      Before this PR:
      =========================
      Fragment Tombstones:     270.286 us
      AddTombstones:           1.28933 us
      ShouldDelete (first):    0.525528 us
      ShouldDelete (rest):     0.0797519 us
      
      After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
      =========================
      Fragment Tombstones:     149.879 us
      AddTombstones:           102.131 us
      ShouldDelete (first):    0.565871 us
      ShouldDelete (rest):     0.0729444 us
      ```
      - db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
      ```
      ./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25
      
      Before this PR:
      readwhilewriting :      18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations;   28.1 MB/s (41001 of 250000 found)
      After this PR:
      readwhilewriting :      16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations;   23.8 MB/s (28977 of 250000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D40646227
      
      Pulled By: cbi42
      
      fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
      7a959388
  5. 23 9月, 2022 1 次提交
    • P
      Refactor to avoid confusing "raw block" (#10408) · ef443cea
      Peter Dillinger 提交于
      Summary:
      We have a lot of confusing code because of mixed, sometimes
      completely opposite uses of of the term "raw block" or "raw contents",
      sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
      `raw_block_contents` and `raw_size` generally referred to uncompressed block
      contents and size, while `WriteRawBlock` referred to writing a block that
      is already compressed if it is going to be. Meanwhile, in
      `BlockBasedTable`, `raw_block_contents` either referred to a (maybe
      compressed) block with trailer, or a maybe compressed block maybe
      without trailer. (Note: left as follow-up work to use C++ typing to
      better sort out the various kinds of BlockContents.)
      
      This change primarily tries to apply some consistent terminology around
      the kinds of block representations, avoiding the unclear "raw". (Any
      meaning of "raw" assumes some bias toward the storage layer or toward
      the logical data layer.) Preferred terminology:
      
      * **Serialized block** - bytes that go into storage. For block-based table
      (usually the case) this includes the block trailer. WART: block `size` may or
      may not include the trailer; need to be clear about whether it does or not.
      * **Maybe compressed block** - like a serialized block, but without the
      trailer (or no promise of including a trailer). Must be accompanied by a
      CompressionType.
      * **Uncompressed block** - "payload" bytes that are either stored with no
      compression, used as input to compression function, or result of
      decompression function.
      * **Parsed block** - an in-memory form of a block in block cache, as it is
      used by the table reader. Different C++ types are used depending on the
      block type (see block_like_traits.h).
      
      Other refactorings:
      * Misc corrections/improvements of internal API comments
      * Remove a few misleading / unhelpful / redundant comments.
      * Use move semantics in some places to simplify contracts
      * Use better parameter names to indicate which parameters are used for
      outputs
      * Remove some extraneous `extern`
      * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
      
      Test Plan: existing tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38172617
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
      ef443cea
  6. 08 9月, 2022 1 次提交
    • P
      Always verify SST unique IDs on SST file open (#10532) · 6de7081c
      Peter Dillinger 提交于
      Summary:
      Although we've been tracking SST unique IDs in the DB manifest
      unconditionally, checking has been opt-in and with an extra pass at DB::Open
      time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
      check unique ID against manifest every time an SST file is opened through
      table cache (normal DB operations), replacing the explicit pass over files
      at DB::Open time. This change also enables the option by default and
      removes the "EXPERIMENTAL" designation.
      
      One possible criticism is that the option no longer ensures the integrity
      of a DB at Open time. This is far from an all-or-nothing issue. Verifying
      the IDs of all SST files hardly ensures all the data in the DB is readable.
      (VerifyChecksum is supposed to do that.) Also, with
      max_open_files=-1 (default, extremely common), all SST files are
      opened at DB::Open time anyway.
      
      Implementation details:
      * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
      that is now removed.
      * Unit tests that manipulate/corrupt table properties have to opt out of
      this check, because that corrupts the "actual" unique id. (And even for
      testing we don't currently have a mechanism to set "no unique id"
      in the in-memory file metadata for new files.)
      * A lot of other unit test churn relates to (a) default checking on, and
      (b) checking on SST open even without DB::Open (e.g. on flush)
      * Use `FileMetaData` for more `TableCache` operations (in place of
      `FileDescriptor`) so that we have access to the unique_id whenever
      we might need to open an SST file. **There is the possibility of
      performance impact because we can no longer use the more
      localized `fd` part of an `FdWithKeyRange` but instead follow the
      `file_metadata` pointer. However, this change (possible regression)
      is only done for `GetMemoryUsageByTableReaders`.**
      * Removed a completely unnecessary constructor overload of
      `TableReaderOptions`
      
      Possible follow-up:
      * Verification only happens when opening through table cache. Are there
      more places where this should happen?
      * Improve error message when there is a file size mismatch vs. manifest
      (FIXME added in the appropriate place).
      * I'm not sure there's a justification for `FileDescriptor` to be distinct from
      `FileMetaData`.
      * I'm skeptical that `FdWithKeyRange` really still makes sense for
      optimizing some data locality by duplicating some data in memory, but I
      could be wrong.
      * An unnecessary overload of NewTableReader was recently added, in
      the public API nonetheless (though unusable there). It should be cleaned
      up to put most things under `TableReaderOptions`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
      
      Test Plan:
      updated unit tests
      
      Performance test showing no significant difference (just noise I think):
      `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
      Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
      After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38765551
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
      6de7081c
  7. 02 9月, 2022 1 次提交
  8. 13 8月, 2022 1 次提交
    • P
      Derive cache keys from SST unique IDs (#10394) · 86a1e3e0
      Peter Dillinger 提交于
      Summary:
      ... so that cache keys can be derived from DB manifest data
      before reading the file from storage--so that every part of the file
      can potentially go in a persistent cache.
      
      See updated comments in cache_key.cc for technical details. Importantly,
      the new cache key encoding uses some fancy but efficient math to pack
      data into the cache key without depending on the sizes of the various
      pieces. This simplifies some existing code creating cache keys, like
      cache warming before the file size is known.
      
      This should provide us an essentially permanent mapping between SST
      unique IDs and base cache keys, with the ability to "upgrade" SST
      unique IDs (and thus cache keys) with new SST format_versions.
      
      These cache keys are of similar, perhaps indistinguishable quality to
      the previous generation. Before this change (see "corrected" days
      between collision):
      
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=43
      18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
      ```
      
      After this change (keep 43 bits, up through 50, to validate "trajectory"
      is ok on "corrected" days between collision):
      ```
      19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
      16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
      15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
      15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
      15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
      15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
      15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
      15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
      ```
      
      However, the change does modify (probably weaken) the "guaranteed unique" promise from this
      
      > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
      
      to this (see https://github.com/facebook/rocksdb/issues/10388)
      
      > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
      
      I don't think this is a practical concern, though.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
      
      Test Plan: unit tests updated, see simulation results above
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38667529
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
      86a1e3e0
  9. 05 8月, 2022 1 次提交
    • A
      Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb
      anand76 提交于
      Summary:
      This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38245910
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
      bf4532eb
  10. 24 7月, 2022 1 次提交
    • S
      Improve SubCompaction Partitioning (#10393) · 252bea40
      sdong 提交于
      Summary:
      Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393
      
      Test Plan:
      1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
      2. Add a new unit test to ApproximateKeyAnchors()
      3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38043783
      
      fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
      252bea40
  11. 25 6月, 2022 1 次提交
  12. 17 6月, 2022 2 次提交
    • P
      More testing w/prefix extractor, small refactor (#10122) · fff302d9
      Peter Dillinger 提交于
      Summary:
      There was an interesting code path not covered by testing that
      is difficult to replicate in a unit test, which is now covered using a
      sync point. Specifically, the case of table_prefix_extractor == null and
      !need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which
      can happen if table reader is open before extractor is registered with global
      object registry, but is later registered and re-set with SetOptions. (We
      don't have sufficient testing control over object registry to set that up
      repeatedly.)
      
      Also, this function has been renamed to `PrefixRangeMayMatch` for clarity
      vs. other functions that are not the same.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122
      
      Test Plan: unit tests expanded
      
      Reviewed By: siying
      
      Differential Revision: D36944834
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb
      fff302d9
    • P
      Remove deprecated block-based filter (#10184) · 126c2237
      Peter Dillinger 提交于
      Summary:
      In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
      the public API, because of its inefficiency. Although we normally maintain read compatibility
      on old DBs forever, filters are not required for reading a DB, only for optimizing read
      performance. Thus, it should be acceptable to remove this code and the substantial
      maintenance burden it carries as useful features are developed and validated (such
      as user timestamp).
      
      This change completely removes the code for reading and writing the old block-based
      filters, net removing about 1370 lines of code no longer needed. Options removed from
      testing / benchmarking tools. The prior existence is only evident in a couple of places:
      * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
      a major release to minimize source code incompatibilities.
      * A warning is logged when an old table file is opened that used the old block-based
      filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
      should suffice. Unfortunately, sst_dump does not tell you whether a file uses
      block-based filter, and the structure of the code makes it very difficult to fix.
      * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
      for metaindex is maintained (for now).
      
      Other notes:
      * In some cases where numbers are associated with filter configurations, we have had to
      update the assigned numbers so that they all correspond to something that exists.
      * Fixed potential stat counting bug by assuming `filter_checked = false` for cases
      like `filter == nullptr` rather than assuming `filter_checked = true`
      * Removed obsolete `block_offset` and `prefix_extractor` parameters from several
      functions.
      * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
      because the caller guarantees the prefix extractor exists and is compatible
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
      
      Test Plan:
      tests updated, manually test new warning in LOG using base version to
      generate a DB
      
      Reviewed By: riversand963
      
      Differential Revision: D37212647
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
      126c2237
  13. 16 6月, 2022 1 次提交
    • A
      Add few optimizations in async_io for short scans (#10140) · 8353ae8b
      Akanksha Mahajan 提交于
      Summary:
      This PR adds few optimizations for async_io for shorter scans.
      1.  If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching.  This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead.
      2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140
      
      Test Plan:
      1. Added a unit test
                        2. Ran crash_test with async_io = 1 to make sure nothing crashes.
      
      Reviewed By: anand1976
      
      Differential Revision: D37042242
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda
      8353ae8b
  14. 10 6月, 2022 1 次提交
    • P
      Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128) · d3a3b021
      Peter Dillinger 提交于
      Summary:
      When opening an SST file created using index_type=kHashSearch,
      the *current* prefix_extractor would be saved, and used with hash index
      if the *new current* prefix_extractor at query time is compatible with
      the SST file. This is a problem if the prefix_extractor at SST open time
      is not compatible but SetOptions later changes (back) to one that is
      compatible.
      
      This change fixes that by using the known compatible (or missing) prefix
      extractor we save for use with prefix filtering. Detail: I have moved the
      InternalKeySliceTransform wrapper to avoid some indirection and remove
      unnecessary fields.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128
      
      Test Plan:
      expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails
      before fix and probably covers some other previously uncovered cases.
      
      Reviewed By: siying
      
      Differential Revision: D36955738
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10
      d3a3b021
  15. 07 6月, 2022 1 次提交
    • P
      Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098) · 4f78f969
      Peter Dillinger 提交于
      Summary:
      We have three related concepts:
      * BlockType: an internal enum conceptually indicating a type of SST file
      block
      * CacheEntryRole: a user-facing enum for categorizing block cache entries,
      which is also involved in associated cache entries with an appropriate
      deleter. Can include categories for non-block cache entries (e.g. memory
      reservations).
      * TBlocklike: a C++ type for the actual type behind a void* cache entry.
      
      We had some existing code ugliness because BlockType did not imply
      TBlocklike, because of various kinds of "filter" block. This refactoring
      fixes that with new BlockTypes.
      
      More clean-up can come in later work.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098
      
      Test Plan: existing tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36897945
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295
      4f78f969
  16. 21 5月, 2022 1 次提交
    • A
      Seek parallelization (#9994) · 2db6a4a1
      Akanksha Mahajan 提交于
      Summary:
      The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
      This flow can be parallelized.
      
      Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
      - Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
      - If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
      - If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994
      
      Test Plan:
      - **Run Regressions.**
      ```
      ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
      ```
      i) Previous release 7.0 run for normal prefetching with async_io disabled:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
      ii) normal prefetching after changes with async_io disable:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Set seed to 1652922591315307 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:09:51 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations;  340.8 MB/s (249 of 249 found)
      ```
      iii) db_bench with async_io enabled completed succesfully
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
      Set seed to 1652924062021732 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:34:22 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations;  293.6 MB/s (217 of 217 found)
      ```
      
      - db_stress with async_io disabled completed succesfully
      ```
       export CRASH_TEST_EXT_ARGS=" --async_io=0"
       make crash_test -j
      ```
      
      I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.
      
      Reviewed By: anand1976
      
      Differential Revision: D36459323
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c
      2db6a4a1
  17. 20 5月, 2022 1 次提交
    • A
      Multi file concurrency in MultiGet using coroutines and async IO (#9968) · 57997dda
      anand76 提交于
      Summary:
      This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.
      
      A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.
      
      TODO:
      1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
      2. Do some stress testing with coroutines enabled
      
      No regression in synchronous MultiGet between this branch and main -
      ```
      ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
      ```
      Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```
      
      Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```
      
      More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.
      
      1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
      No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
      Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```
      
      2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
      No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
      Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```
      
      3. Single thread CPU bound workload with ~2 key overlap/file -
      No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
      Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```
      
      4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
      No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
      Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36348563
      
      Pulled By: anand1976
      
      fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
      57997dda
  18. 26 4月, 2022 1 次提交
    • A
      Add stats related to async prefetching (#9845) · 3653029d
      Akanksha Mahajan 提交于
      Summary:
      Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS.
      PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by
      FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling
      file_system Poll API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845
      
      Test Plan: Update existing tests
      
      Reviewed By: anand1976
      
      Differential Revision: D35909694
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36
      3653029d
  19. 16 4月, 2022 1 次提交
    • A
      Make initial auto readahead_size configurable (#9836) · 0c7f455f
      Akanksha Mahajan 提交于
      Summary:
      Make initial auto readahead_size configurable
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9836
      
      Test Plan:
      Added new unit test
      Ran regression:
      Without change:
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
      With this change:
      ```
       ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Set seed to 1649895440554504 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.2
      Date:       Wed Apr 13 17:17:20 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      ... finished 100 ops
      seekrandom   :  476892.488 micros/op 2 ops/sec;  344.6 MB/s (252 of 252 found)
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D35632815
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: c8057a88f9294c9d03b1d434b03affe02f74d796
      0c7f455f
  20. 13 4月, 2022 1 次提交
    • P
      Meta-internal folly integration with F14FastMap (#9546) · efd03516
      Peter Dillinger 提交于
      Summary:
      Especially after updating to C++17, I don't see a compelling case for
      *requiring* any folly components in RocksDB. I was able to purge the existing
      hard dependencies, and it can be quite difficult to strip out non-trivial components
      from folly for use in RocksDB. (The prospect of doing that on F14 has changed
      my mind on the best approach here.)
      
      But this change creates an optional integration where we can plug in
      components from folly at compile time, starting here with F14FastMap to replace
      std::unordered_map when possible (probably no public APIs for example). I have
      replaced the biggest CPU users of std::unordered_map with compile-time
      pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
      USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
      that is in the Makefile for public CI testing. A full folly build is not needed, but
      checking out the full folly repo is much simpler for getting the dependency,
      and anything else we might want to optionally integrate in the future.
      
      Some picky details:
      * I don't think the distributed mutex stuff is actually used, so it was easy to remove.
      * I implemented an alternative to `folly::constexpr_log2` (which is much easier
      in C++17 than C++11) so that I could pull out the hard dependencies on
      `ConstexprMath.h`
      * I had to add noexcept move constructors/operators to some types to make
      F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
      macro to make that easier in some common cases.
      * Updated Meta-internal buck build to use folly F14Map (always)
      
      No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
      production integration for open source users.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
      
      Test Plan:
      CircleCI tests updated so that a couple of them use folly.
      
      Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
      (Note: they should probably use buck but they currently use Makefile.)
      
      Example performance improvement: when filter partitions are pinned in cache,
      they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
      a test that exercises that heavily. Build DB with
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
      ```
      
      and test with (simultaneous runs with & without folly, ~20 times each to see
      convergence)
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
      ```
      
      Average ops/s no folly: 26229.2
      Average ops/s with folly: 26853.3 (+2.4%)
      
      Reviewed By: ajkr
      
      Differential Revision: D34181736
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
      efd03516
  21. 07 4月, 2022 1 次提交
    • H
      Account memory of big memory users in BlockBasedTable in global memory limit (#9748) · 49623f9c
      Hui Xiao 提交于
      Summary:
      **Context:**
      Through heap profiling, we discovered that `BlockBasedTableReader` objects can accumulate and lead to high memory usage (e.g, `max_open_file = -1`). These memories are currently not saved, not tracked, not constrained and not cache evict-able. As a first step to improve this, similar to https://github.com/facebook/rocksdb/pull/8428,  this PR is to track an estimate of `BlockBasedTableReader` object's memory in block cache and fail future creation if the memory usage exceeds the available space of cache at the time of creation.
      
      **Summary:**
      - Approximate big memory users  (`BlockBasedTable::Rep` and `TableProperties` )' memory usage in addition to the existing estimated ones (filter block/index block/un-compression dictionary)
      - Charge all of these memory usages to block cache on `BlockBasedTable::Open()` and release them on `~BlockBasedTable()` as there is no memory usage fluctuation of concern in between
      - Refactor on CacheReservationManager (and its call-sites) to add concurrent support for BlockBasedTable  used in this PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9748
      
      Test Plan:
      - New unit tests
      - db bench: `OpenDb` : **-0.52% in ms**
        - Setup `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -write_buffer_size=1048576`
        - Repeated run with pre-change w/o feature and post-change with feature, benchmark `OpenDb`:  `./db_bench -benchmarks=readrandom -use_existing_db=1 -db=/dev/shm/testdb -reserve_table_reader_memory=true (remove this when running w/o feature) -file_opening_threads=3 -open_files=-1 -report_open_timing=true| egrep 'OpenDb:'`
      
      #-run | (feature-off) avg milliseconds | std milliseconds | (feature-on) avg milliseconds | std milliseconds | change (%)
      -- | -- | -- | -- | -- | --
      10 | 11.4018 | 5.95173 | 9.47788 | 1.57538 | -16.87382694
      20 | 9.23746 | 0.841053 | 9.32377 | 1.14074 | 0.9343477536
      40 | 9.0876 | 0.671129 | 9.35053 | 1.11713 | 2.893283155
      80 | 9.72514 | 2.28459 | 9.52013 | 1.0894 | -2.108041632
      160 | 9.74677 | 0.991234 | 9.84743 | 1.73396 | 1.032752389
      320 | 10.7297 | 5.11555 | 10.547 | 1.97692 | **-1.70275031**
      640 | 11.7092 | 2.36565 | 11.7869 | 2.69377 | **0.6635807741**
      
      -  db bench on write with cost to cache in WriteBufferManager (just in case this PR's CRM refactoring accidentally slows down anything in WBM) : `fillseq` : **+0.54% in micros/op**
      `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -cost_write_buffer_to_cache=true -write_buffer_size=10000000000 | egrep 'fillseq'`
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  avg micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 6.15 | 0.260187 | 6.289 | 0.371192 | 2.260162602
      20 | 7.28025 | 0.465402 | 7.37255 | 0.451256 | 1.267813605
      40 | 7.06312 | 0.490654 | 7.13803 | 0.478676 | **1.060579461**
      80 | 7.14035 | 0.972831 | 7.14196 | 0.92971 | **0.02254791432**
      
      -  filter bench: `bloom filter`: **-0.78% in ms/key**
          - ` ./filter_bench -impl=2 -quick -reserve_table_builder_memory=true | grep 'Build avg'`
      
      #-run | (pre-PR) avg ns/key | std ns/key | (post-PR)  ns/key | std ns/key | change (%)
      -- | -- | -- | -- | -- | --
      10 | 26.4369 | 0.442182 | 26.3273 | 0.422919 | **-0.4145720565**
      20 | 26.4451 | 0.592787 | 26.1419 | 0.62451 | **-1.1465262**
      
      - Crash test `python3 tools/db_crashtest.py blackbox --reserve_table_reader_memory=1 --cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D35136549
      
      Pulled By: hx235
      
      fbshipit-source-id: 146978858d0f900f43f4eb09bfd3e83195e3be28
      49623f9c
  22. 05 4月, 2022 1 次提交
    • A
      Fix segfault in FilePrefetchBuffer with async_io enabled (#9777) · 36bc3da9
      Akanksha Mahajan 提交于
      Summary:
      If FilePrefetchBuffer object is destroyed and then later Poll() calls callback on object which has been destroyed, it gives segfault on accessing destroyed object. It was caught after adding unit tests that tests Posix implementation of ReadAsync and Poll APIs.
      This PR also updates and fixes existing IOURing tests which were not running locally because RocksDbIOUringEnable function wasn't defined and IOUring was disabled for those tests
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9777
      
      Test Plan: Added new unit test
      
      Reviewed By: anand1976
      
      Differential Revision: D35254002
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 68e80054ffb14ae25c255920ebc6548ca5f130a1
      36bc3da9
  23. 25 3月, 2022 1 次提交
  24. 21 3月, 2022 1 次提交
    • A
      Provide implementation to prefetch data asynchronously in FilePrefetchBuffer (#9674) · 49a10feb
      Akanksha Mahajan 提交于
      Summary:
      In FilePrefetchBuffer if reads are sequential, after prefetching call ReadAsync API to prefetch data asynchronously so that in next prefetching data will be available. Data prefetched asynchronously will be readahead_size/2. It uses two buffers, one for synchronous prefetching and one for asynchronous. In case, the data is overlapping, the data is copied from both buffers to third buffer to make it continuous.
      This feature is under ReadOptions::async_io and is under experimental.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9674
      
      Test Plan:
      1. Add new unit tests
      2. Run **db_stress** to make sure nothing crashes.
      
          -   Normal prefetch without `async_io` ran successfully:
      ```
      export CRASH_TEST_EXT_ARGS=" --async_io=0"
       make crash_test -j
       ```
      
      3. **Run Regressions**.
         i) Main branch without any change for normal prefetching with async_io disabled:
      
       ```
       ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -
                 use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
       ```
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
        ii) normal prefetching after changes with async_io disable:
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_withchange -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 14:11:31 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_withchange]
      seekrandom   :  471347.227 micros/op 2 ops/sec;  348.1 MB/s (255 of 255 found)
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D34731543
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 8e23aa93453d5fe3c672b9231ad582f60207937f
      49a10feb
  25. 02 3月, 2022 1 次提交
  26. 01 2月, 2022 1 次提交
    • P
      Ignore `total_order_seek` in DB::Get (#9427) · f6d7ec1d
      Peter Dillinger 提交于
      Summary:
      Apparently setting total_order_seek=true for DB::Get was
      intended to allow accurate read semantics if the current prefix
      extractor doesn't match what was used to generate SST files on
      disk. But since prefix_extractor was made a mutable option in 5.14.0, we
      have been able to detect this case and provide the correct semantics
      regardless of the total_order_seek option. Since that time, the option
      has only made Get() slower in a reasonably common case: prefix_extractor
      unchanged and whole_key_filtering=false.
      
      So this change primarily removes unnecessary effect of
      total_order_seek on Get. Also cleans up some related comments.
      
      Also adds a -total_order_seek option to db_bench and canonicalizes
      handling of ReadOptions in db_bench so that command line options have
      the expected association with library features. (There is potential
      for change in regression test behavior, but the old behavior is likely
      indefensible, or some other inconsistency would need to be fixed.)
      
      TODO in follow-up work: there should be no reason for Get() to depend on
      current prefix extractor at all.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
      
      Test Plan:
      Unit tests updated.
      
      Performance (using db_bench update)
      
      Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
      
      Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
      
      Before this change, total_order_seek=false: 25188 ops/sec
      Before this change, total_order_seek=true:   1222 ops/sec (~20x slower)
      
      After this change, total_order_seek=false: 24570 ops/sec
      After this change, total_order_seek=true:  25012 ops/sec (indistinguishable)
      
      Reviewed By: siying
      
      Differential Revision: D33753458
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
      f6d7ec1d
  27. 22 1月, 2022 1 次提交
    • P
      Fast path for detecting unchanged prefix_extractor (#9407) · fc9d4071
      Peter Dillinger 提交于
      Summary:
      Fixes a major performance regression in 6.26, where
      extra CPU is spent in SliceTransform::AsString when reads involve
      a prefix_extractor (Get, MultiGet, Seek). Common case performance
      is now better than 6.25.
      
      This change creates a "fast path" for verifying that the current prefix
      extractor is unchanged and compatible with what was used to
      generate a table file. This fast path detects the common case by
      pointer comparison on the current prefix_extractor and a "known
      good" prefix extractor (if applicable) that is saved at the time the
      table reader is opened. The "known good" prefix extractor is saved
      as another shared_ptr copy (in an existing field, however) to ensure
      the pointer is not recycled.
      
      When the prefix_extractor has changed to a different instance but
      same compatible configuration (rare, odd), performance is still a
      regression compared to 6.25, but this is likely acceptable because
      of the oddity of such a case. The performance of incompatible
      prefix_extractor is essentially unchanged.
      
      Also fixed a minor case (ForwardIterator) where a prefix_extractor
      could be used via a raw pointer after being freed as a shared_ptr,
      if replaced via SetOptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407
      
      Test Plan:
      ## Performance
      Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
      
      Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
      
      Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load)
      
      v6.26: 4833 vs. 6698 (<- major regression!)
      v6.27: 4737 vs. 6397 (still)
      New: 6704 vs. 6461 (better than baseline in common case)
      Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible)
      Changed prefix size (no usable filter) in new: 787 vs. 5927
      Changed prefix size (no usable filter) in new & baseline: 773 vs. 784
      
      Reviewed By: mrambacher
      
      Differential Revision: D33677812
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
      fc9d4071
  28. 17 12月, 2021 1 次提交
    • P
      New stable, fixed-length cache keys (#9126) · 0050a73a
      Peter Dillinger 提交于
      Summary:
      This change standardizes on a new 16-byte cache key format for
      block cache (incl compressed and secondary) and persistent cache (but
      not table cache and row cache).
      
      The goal is a really fast cache key with practically ideal stability and
      uniqueness properties without external dependencies (e.g. from FileSystem).
      A fixed key size of 16 bytes should enable future optimizations to the
      concurrent hash table for block cache, which is a heavy CPU user /
      bottleneck, but there appears to be measurable performance improvement
      even with no changes to LRUCache.
      
      This change replaces a lot of disjointed and ugly code handling cache
      keys with calls to a simple, clean new internal API (cache_key.h).
      (Preserving the old cache key logic under an option would be very ugly
      and likely negate the performance gain of the new approach. Complete
      replacement carries some inherent risk, but I think that's acceptable
      with sufficient analysis and testing.)
      
      The scheme for encoding new cache keys is complicated but explained
      in cache_key.cc.
      
      Also: EndianSwapValue is moved to math.h to be next to other bit
      operations. (Explains some new include "math.h".) ReverseBits operation
      added and unit tests added to hash_test for both.
      
      Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
      
      Test Plan:
      ### Basic correctness
      Several tests needed updates to work with the new functionality, mostly
      because we are no longer relying on filesystem for stable cache keys
      so table builders & readers need more context info to agree on cache
      keys. This functionality is so core, a huge number of existing tests
      exercise the cache key functionality.
      
      ### Performance
      Create db with
      `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
      And test performance with
      `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
      using DEBUG_LEVEL=0 and simultaneous before & after runs.
      Before ops/sec, avg over 100 runs: 121924
      After ops/sec, avg over 100 runs: 125385 (+2.8%)
      
      ### Collision probability
      I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
      over many months, by making some pessimistic simplifying assumptions:
      * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
      * All of every file is cached for its entire lifetime
      
      We use a simple table with skewed address assignment and replacement on address collision
      to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
      with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
      
      ```
      Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
      Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
      ```
      
      These come from default settings of 2.5M files per day of 32 MB each, and
      `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
      the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
      is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
      
      More default assumptions, relatively pessimistic:
      * 100 DBs in same process (doesn't matter much)
      * Re-open DB in same process (new session ID related to old session ID) on average
      every 100 files generated
      * Restart process (all new session IDs unrelated to old) 24 times per day
      
      After enough data, we get a result at the end:
      
      ```
      (keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
      ```
      
      If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
      
      ```
      (keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
      (keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
      ```
      
      The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
      
      ```
      197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
      ```
      
      I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D33171746
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
      0050a73a
  29. 11 12月, 2021 1 次提交
    • P
      More refactoring ahead of footer & meta changes (#9240) · 653c392e
      Peter Dillinger 提交于
      Summary:
      I'm working on a new format_version=6 to support context
      checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
      updates to support that change.
      
      Test coverage data and manual inspection agree on dead code in
      block_based_table_reader.cc (removed).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
      
      Test Plan:
      tests enhanced to cover more cases etc.
      
      Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
      
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
      
      (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
      Before w/ wal: 454512
      After w/ wal: 444820 (-2.1%)
      Before w/o wal: 1004560
      After w/o wal: 998897 (-0.6%)
      
      Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
      
      This regression will be corrected in a follow-up PR.
      
      Reviewed By: ajkr
      
      Differential Revision: D32813769
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
      653c392e
  30. 20 11月, 2021 1 次提交
    • L
      Support readahead during compaction for blob files (#9187) · dc5de45a
      Levi Tamasi 提交于
      Summary:
      The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
      that can be used to enable prefetching data from blob files during compaction.
      This is important when using storage with higher latencies like HDDs or remote filesystems.
      If enabled, prefetching is used for all cases when blobs are read during compaction,
      namely garbage collection, compaction filters (when the existing value has to be read from
      a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
      
      Test Plan: Ran `make check` and the stress/crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D32565512
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
      dc5de45a
  31. 19 11月, 2021 1 次提交
    • P
      Improve / clean up meta block code & integrity (#9163) · 230660be
      Peter Dillinger 提交于
      Summary:
      * Checksums are now checked on meta blocks unless specifically
      suppressed or not applicable (e.g. plain table). (Was other way around.)
      This means a number of cases that were not checking checksums now are,
      including direct read TableProperties in Version::GetTableProperties
      (fixed in meta_blocks ReadTableProperties), reading any block from
      PersistentCache (fixed in BlockFetcher), read TableProperties in
      SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
      maybe more.
      * For that to work, I moved the global_seqno+TableProperties checksum
      logic to the shared table/ code, because that is used by many utilies
      such as SstFileDumper.
      * Also for that to work, we have to know when we're dealing with a block
      that has a checksum (trailer), so added that capability to Footer based
      on magic number, and from there BlockFetcher.
      * Knowledge of trailer presence has also fixed a problem where other
      table formats were reading blocks including bytes for a non-existant
      trailer--and awkwardly kind-of not using them, e.g. no shared code
      checking checksums. (BlockFetcher compression type was populated
      incorrectly.) Now we only read what is needed.
      * Minimized code duplication and differing/incompatible/awkward
      abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
      without parsing block handle)
      * Moved some meta block handling code from table_properties*.*
      * Moved some code specific to block-based table from shared table/ code
      to BlockBasedTable class. The checksum stuff means we can't completely
      separate it, but things that don't need to be in shared table/ code
      should not be.
      * Use unique_ptr rather than raw ptr in more places. (Note: you can
      std::move from unique_ptr to shared_ptr.)
      
      Without enhancements to GetPropertiesOfAllTablesTest (see below),
      net reduction of roughly 100 lines of code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163
      
      Test Plan:
      existing tests and
      * Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
      checksums are now checked on direct read of table properties by TableCache
      (new test would fail before this change)
      * Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
      putting table properties under old meta name
      * Also generally enhanced that same test to actually test what it was
      supposed to be testing already, by kicking things out of table cache when
      we don't want them there.
      
      Reviewed By: ajkr, mrambacher
      
      Differential Revision: D32514757
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
      230660be
  32. 20 10月, 2021 1 次提交
    • Z
      Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050) · 6d93b875
      Zhichao Cao 提交于
      Summary:
      Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050
      
      Test Plan: added new tests
      
      Reviewed By: anand1976
      
      Differential Revision: D31744769
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
      6d93b875
  33. 29 9月, 2021 1 次提交
    • M
      Cleanup includes in dbformat.h (#8930) · 13ae16c3
      mrambacher 提交于
      Summary:
      This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.
      
      Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
      
      Hopefully, this sort of code hygiene cleanup will speed up the builds...
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
      
      Reviewed By: pdillinger
      
      Differential Revision: D31142788
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
      13ae16c3
  34. 08 9月, 2021 1 次提交
  35. 17 8月, 2021 1 次提交
    • P
      Stable cache keys using DB session ids in SSTs (#8659) · a207c278
      Peter Dillinger 提交于
      Summary:
      Use DB session ids in SST table properties to make cache keys
      stable across DB re-open and copy / move / restore / etc.
      
      These new cache keys are currently only enabled when FileSystem does not
      provide GetUniqueId. For now, they are typically larger, so slightly
      less efficient.
      
      Relevant to https://github.com/facebook/rocksdb/issues/7405
      
      This change has a minor regression in PersistentCache functionality:
      metaindex blocks are no longer cached in PersistentCache. Table properties
      blocks already were not but ideally should be. I didn't spent effort to
      fix & test these issues because we don't believe PersistentCache is used much
      if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)
      
      FIXME: there is more to be fixed for stable cache keys on external SST files
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659
      
      Test Plan:
      new unit test added, which fails when disabling new
      functionality
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30297705
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
      a207c278
  36. 19 6月, 2021 1 次提交
    • A
      Parallelize secondary cache lookup in MultiGet (#8405) · 8ea0a2c1
      anand76 提交于
      Summary:
      Implement the ```WaitAll()``` interface in ```LRUCache``` to allow callers to issue multiple lookups in parallel and wait for all of them to complete. Modify ```MultiGet``` to use this to parallelize the secondary cache lookups in order to reduce the overall latency. A call to ```cache->Lookup()``` returns a handle that has an incomplete value (nullptr), and the caller can call ```cache->IsReady()``` to check whether the lookup is complete, and pass a vector of handles to ```WaitAll``` to wait for completion. If any of the lookups fail, ```MultiGet``` will read the block from the SST file.
      
      Another change in this PR is to rename ```SecondaryCacheHandle``` to ```SecondaryCacheResultHandle``` as it more accurately describes the return result of the secondary cache lookup, which is more like a future.
      
      Tests:
      1. Add unit tests in lru_cache_test
      2. Benchmark results with no secondary cache configured
      Master -
      ```
      readrandom   :      41.175 micros/op 388562 ops/sec;  106.7 MB/s (7277999 of 7277999 found)
      readrandom   :      41.217 micros/op 388160 ops/sec;  106.6 MB/s (7274999 of 7274999 found)
      multireadrandom :      10.309 micros/op 1552082 ops/sec; (28908992 of 28908992 found)
      multireadrandom :      10.321 micros/op 1550218 ops/sec; (29081984 of 29081984 found)
      ```
      
      This PR -
      ```
      readrandom   :      41.158 micros/op 388723 ops/sec;  106.8 MB/s (7290999 of 7290999 found)
      readrandom   :      41.185 micros/op 388463 ops/sec;  106.7 MB/s (7287999 of 7287999 found)
      multireadrandom :      10.277 micros/op 1556801 ops/sec; (29346944 of 29346944 found)
      multireadrandom :      10.253 micros/op 1560539 ops/sec; (29274944 of 29274944 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8405
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29190509
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6f8eff6246712af8a297cfe22ea0d1c3b2a01bb0
      8ea0a2c1
  37. 18 6月, 2021 1 次提交
    • A
      Cache warming data blocks during flush (#8242) · 5ba1b6e5
      Akanksha Mahajan 提交于
      Summary:
      This PR prepopulates warm/hot data blocks which are already in memory
      into block cache at the time of flush. On a flush, the data block that is
      in memory (in memtables) get flushed to the device. If using Direct IO,
      additional IO is incurred to read this data back into memory again, which
      is avoided by enabling newly added option.
      
       Right now, this is enabled only for flush for data blocks. We plan to
      expand this option to cover compactions in the future and for other types
       of blocks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
      
      Test Plan: Add new unit test
      
      Reviewed By: anand1976
      
      Differential Revision: D28521703
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
      5ba1b6e5
  38. 11 6月, 2021 1 次提交
    • Z
      Use DbSessionId as cache key prefix when secondary cache is enabled (#8360) · f44e69c6
      Zhichao Cao 提交于
      Summary:
      Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
      
      Test Plan: add the test to lru_cache_test
      
      Reviewed By: pdillinger
      
      Differential Revision: D29006215
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
      f44e69c6
  39. 22 5月, 2021 1 次提交
    • Z
      Use new Insert and Lookup APIs in table reader to support secondary cache (#8315) · 7303d02b
      Zhichao Cao 提交于
      Summary:
      Secondary cache is implemented to achieve the secondary cache tier for block cache. New Insert and Lookup APIs are introduced in https://github.com/facebook/rocksdb/issues/8271  . To support and use the secondary cache in block based table reader, this PR introduces the corresponding callback functions that will be used in secondary cache, and update the Insert and Lookup APIs accordingly.
      
      benchmarking:
      ./db_bench --benchmarks="fillrandom" -num=1000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/tmp/rocks_t/db -partition_index_and_filters=true
      
      ./db_bench -db=/tmp/rocks_t/db -use_existing_db=true -benchmarks=readrandom -num=1000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=5 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -stats_dump_period_sec=30 -reads=50000000
      
      master benchmarking results:
      readrandom   :       3.923 micros/op 254881 ops/sec;   33.4 MB/s (23849796 of 50000000 found)
      rocksdb.db.get.micros P50 : 2.820992 P95 : 5.636716 P99 : 16.450553 P100 : 8396.000000 COUNT : 50000000 SUM : 179947064
      
      Current PR benchmarking results
      readrandom   :       4.083 micros/op 244925 ops/sec;   32.1 MB/s (23849796 of 50000000 found)
      rocksdb.db.get.micros P50 : 2.967687 P95 : 5.754916 P99 : 15.665912 P100 : 8213.000000 COUNT : 50000000 SUM : 187250053
      
      About 3.8% throughput reduction.
      P50: 5.2% increasing, P95, 2.09% increasing, P99 4.77% improvement
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8315
      
      Test Plan: added the testing case
      
      Reviewed By: anand1976
      
      Differential Revision: D28599774
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 098c4df0d7327d3a546df7604b2f1602f13044ed
      7303d02b