1. 10 2月, 2023 1 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
  2. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  3. 26 10月, 2022 1 次提交
  4. 23 9月, 2022 1 次提交
    • P
      Refactor to avoid confusing "raw block" (#10408) · ef443cea
      Peter Dillinger 提交于
      Summary:
      We have a lot of confusing code because of mixed, sometimes
      completely opposite uses of of the term "raw block" or "raw contents",
      sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
      `raw_block_contents` and `raw_size` generally referred to uncompressed block
      contents and size, while `WriteRawBlock` referred to writing a block that
      is already compressed if it is going to be. Meanwhile, in
      `BlockBasedTable`, `raw_block_contents` either referred to a (maybe
      compressed) block with trailer, or a maybe compressed block maybe
      without trailer. (Note: left as follow-up work to use C++ typing to
      better sort out the various kinds of BlockContents.)
      
      This change primarily tries to apply some consistent terminology around
      the kinds of block representations, avoiding the unclear "raw". (Any
      meaning of "raw" assumes some bias toward the storage layer or toward
      the logical data layer.) Preferred terminology:
      
      * **Serialized block** - bytes that go into storage. For block-based table
      (usually the case) this includes the block trailer. WART: block `size` may or
      may not include the trailer; need to be clear about whether it does or not.
      * **Maybe compressed block** - like a serialized block, but without the
      trailer (or no promise of including a trailer). Must be accompanied by a
      CompressionType.
      * **Uncompressed block** - "payload" bytes that are either stored with no
      compression, used as input to compression function, or result of
      decompression function.
      * **Parsed block** - an in-memory form of a block in block cache, as it is
      used by the table reader. Different C++ types are used depending on the
      block type (see block_like_traits.h).
      
      Other refactorings:
      * Misc corrections/improvements of internal API comments
      * Remove a few misleading / unhelpful / redundant comments.
      * Use move semantics in some places to simplify contracts
      * Use better parameter names to indicate which parameters are used for
      outputs
      * Remove some extraneous `extern`
      * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
      
      Test Plan: existing tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38172617
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
      ef443cea
  5. 13 4月, 2022 1 次提交
    • P
      Meta-internal folly integration with F14FastMap (#9546) · efd03516
      Peter Dillinger 提交于
      Summary:
      Especially after updating to C++17, I don't see a compelling case for
      *requiring* any folly components in RocksDB. I was able to purge the existing
      hard dependencies, and it can be quite difficult to strip out non-trivial components
      from folly for use in RocksDB. (The prospect of doing that on F14 has changed
      my mind on the best approach here.)
      
      But this change creates an optional integration where we can plug in
      components from folly at compile time, starting here with F14FastMap to replace
      std::unordered_map when possible (probably no public APIs for example). I have
      replaced the biggest CPU users of std::unordered_map with compile-time
      pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
      USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
      that is in the Makefile for public CI testing. A full folly build is not needed, but
      checking out the full folly repo is much simpler for getting the dependency,
      and anything else we might want to optionally integrate in the future.
      
      Some picky details:
      * I don't think the distributed mutex stuff is actually used, so it was easy to remove.
      * I implemented an alternative to `folly::constexpr_log2` (which is much easier
      in C++17 than C++11) so that I could pull out the hard dependencies on
      `ConstexprMath.h`
      * I had to add noexcept move constructors/operators to some types to make
      F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
      macro to make that easier in some common cases.
      * Updated Meta-internal buck build to use folly F14Map (always)
      
      No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
      production integration for open source users.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
      
      Test Plan:
      CircleCI tests updated so that a couple of them use folly.
      
      Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
      (Note: they should probably use buck but they currently use Makefile.)
      
      Example performance improvement: when filter partitions are pinned in cache,
      they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
      a test that exercises that heavily. Build DB with
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
      ```
      
      and test with (simultaneous runs with & without folly, ~20 times each to see
      convergence)
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
      ```
      
      Average ops/s no folly: 26229.2
      Average ops/s with folly: 26853.3 (+2.4%)
      
      Reviewed By: ajkr
      
      Differential Revision: D34181736
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
      efd03516
  6. 19 6月, 2021 1 次提交
    • A
      Parallelize secondary cache lookup in MultiGet (#8405) · 8ea0a2c1
      anand76 提交于
      Summary:
      Implement the ```WaitAll()``` interface in ```LRUCache``` to allow callers to issue multiple lookups in parallel and wait for all of them to complete. Modify ```MultiGet``` to use this to parallelize the secondary cache lookups in order to reduce the overall latency. A call to ```cache->Lookup()``` returns a handle that has an incomplete value (nullptr), and the caller can call ```cache->IsReady()``` to check whether the lookup is complete, and pass a vector of handles to ```WaitAll``` to wait for completion. If any of the lookups fail, ```MultiGet``` will read the block from the SST file.
      
      Another change in this PR is to rename ```SecondaryCacheHandle``` to ```SecondaryCacheResultHandle``` as it more accurately describes the return result of the secondary cache lookup, which is more like a future.
      
      Tests:
      1. Add unit tests in lru_cache_test
      2. Benchmark results with no secondary cache configured
      Master -
      ```
      readrandom   :      41.175 micros/op 388562 ops/sec;  106.7 MB/s (7277999 of 7277999 found)
      readrandom   :      41.217 micros/op 388160 ops/sec;  106.6 MB/s (7274999 of 7274999 found)
      multireadrandom :      10.309 micros/op 1552082 ops/sec; (28908992 of 28908992 found)
      multireadrandom :      10.321 micros/op 1550218 ops/sec; (29081984 of 29081984 found)
      ```
      
      This PR -
      ```
      readrandom   :      41.158 micros/op 388723 ops/sec;  106.8 MB/s (7290999 of 7290999 found)
      readrandom   :      41.185 micros/op 388463 ops/sec;  106.7 MB/s (7287999 of 7287999 found)
      multireadrandom :      10.277 micros/op 1556801 ops/sec; (29346944 of 29346944 found)
      multireadrandom :      10.253 micros/op 1560539 ops/sec; (29274944 of 29274944 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8405
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29190509
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6f8eff6246712af8a297cfe22ea0d1c3b2a01bb0
      8ea0a2c1
  7. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  8. 17 7月, 2019 1 次提交
    • L
      Move the filter readers out of the block cache (#5504) · 3bde41b5
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for the filter block, it is not
      really the block itself that is stored in the cache but a FilterBlockReader
      object. Since this object is not pure data (it has, for instance, pointers that
      might dangle, including in one case a back pointer to the TableReader), it's not
      really sharable. To avoid the issues around this, the current code erases the
      cache entries when the TableReader is closed (which, BTW, is not sufficient
      since a concurrent TableReader might have picked up the object in the meantime).
      Instead of doing this, the patch moves the FilterBlockReader out of the cache
      altogether, and decouples the filter reader object from the filter block.
      In particular, instead of the TableReader owning, or caching/pinning the
      FilterBlockReader (based on the customer's settings), with the change the
      TableReader unconditionally owns the FilterBlockReader, which in turn
      owns/caches/pins the filter block. This change also enables us to reuse the code
      paths historically used for data blocks for filters as well.
      
      Note:
      Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
      separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
      
      Test Plan: make asan_check
      
      Differential Revision: D16036974
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
      3bde41b5
  9. 31 5月, 2019 1 次提交
  10. 11 5月, 2019 1 次提交
    • L
      Turn CachableEntry into a proper resource handle (#5252) · f0bf3bf3
      Levi Tamasi 提交于
      Summary:
      CachableEntry is used in a variety of contexts: it may refer to a cached
      object (i.e. an object in the block cache), an owned object, or an
      unowned object; also, in some cases (most notably with iterators), the
      responsibility of managing the pointed-to object gets handed off to
      another object. Each of the above scenarios have different implications
      for the lifecycle of the referenced object. For the most part, the patch
      does not change the lifecycle of managed objects; however, it makes
      these relationships explicit, and it also enables us to eliminate some
      hacks and accident-prone code around releasing cache handles and
      deleting/cleaning up objects. (The only places where the patch changes
      how an objects are managed are the partitions of partitioned indexes and
      filters.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252
      
      Differential Revision: D15101358
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be
      f0bf3bf3