1. 10 2月, 2023 1 次提交
    • P
      Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4
      Peter Dillinger 提交于
      Summary:
      The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
      
      In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
      
      Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
      
      Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
      
      Test Plan: `make check`
      
      Reviewed By: anand1976
      
      Differential Revision: D43055487
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
      3cacd4b4
  2. 28 1月, 2023 1 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
  3. 25 1月, 2023 1 次提交
  4. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  5. 22 11月, 2022 2 次提交
    • P
      Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945) · e079d562
      Peter Dillinger 提交于
      Summary:
      Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945
      
      Test Plan: existing tests for CacheDumper, added basic unit test
      
      Reviewed By: anand1976
      
      Differential Revision: D41231497
      
      Pulled By: pdillinger
      
      fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
      e079d562
    • P
      Observe and warn about misconfigured HyperClockCache (#10965) · 3182beef
      Peter Dillinger 提交于
      Summary:
      Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.
      
      Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965
      
      Test Plan:
      unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.
      
      Also manual testing with db_bench. Create db with
      ```
      ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
      ```
      Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
      ```
      ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
      ```
      Seeing warnings / errors or not as expected.
      
      Reviewed By: anand1976
      
      Differential Revision: D41406932
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
      3182beef
  6. 18 11月, 2022 1 次提交
    • P
      Mark HyperClockCache as production-ready (#10963) · 8c0f5b1f
      Peter Dillinger 提交于
      Summary:
      After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963
      
      Test Plan: existing tests, comment changes only
      
      Reviewed By: siying
      
      Differential Revision: D41384083
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
      8c0f5b1f
  7. 17 11月, 2022 1 次提交
  8. 12 11月, 2022 1 次提交
    • P
      Don't attempt to use SecondaryCache on block_cache_compressed (#10944) · f321e8fc
      Peter Dillinger 提交于
      Summary:
      Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!
      
      This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.
      
      Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.
      
      This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944
      
      Test Plan: test added that fails on base revision (at least with ASAN)
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D41205578
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
      f321e8fc
  9. 03 11月, 2022 1 次提交
    • P
      Refactor (Hyper)ClockCache code (#10887) · cc8c8f69
      Peter Dillinger 提交于
      Summary:
      For clean-up and in preparation for some other anticipated changes, including
      * A new dynamically-scaling variant of HyperClockCache
      * SecondaryCache support for HyperClockCache
      
      This change does some refactoring for current and future code sharing and reusability. (Including follow-up on https://github.com/facebook/rocksdb/issues/10843)
      
      ## clock_cache.h
      * TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures.
      * A number of helper functions introduced and used.
      * Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard.
      
      ## clock_cache.cc
      * Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations.
      * Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10887
      
      Test Plan:
      fixed a unit testing gap; other minor test updates for refactoring
      
      No functionality change
      
      ## Performance
      Same setup as https://github.com/facebook/rocksdb/issues/10801:
      
      Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec`
      After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec`
      
      I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference.
      
      Reviewed By: anand1976
      
      Differential Revision: D40959240
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949
      cc8c8f69
  10. 01 11月, 2022 1 次提交
    • D
      Fix compilation errors, clang++-15 (#10907) · 9f3475ec
      Denis Hananein 提交于
      Summary:
      I've tried to compile the main branch, but there are two minor things which are make CE.
      I'm not sure about the second one (`num_empty_non_l0_level`), probably there is should be additional assert.
      
      ```
      -c ../cache/clock_cache.cc
      [build] ../cache/clock_cache.cc:855:15: error: variable 'i' set but not used [-Werror,-Wunused-but-set-variable]
      [build]   for (size_t i = 0; &array_[current] != h; i++) {
      [build]               ^
      ```
      
      ```
      [build] ../db/version_set.cc:3665:7: error: variable 'num_empty_non_l0_level' set but not used [-Werror,-Wunused-but-set-variable]
      [build]   int num_empty_non_l0_level = 0;
      [build]       ^
      [build] 1 error generated.
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10907
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40866667
      
      Pulled By: ajkr
      
      fbshipit-source-id: 963b7bd56859d0b3b2779cd36fad229425cb7b17
      9f3475ec
  11. 28 10月, 2022 1 次提交
  12. 27 10月, 2022 1 次提交
  13. 22 10月, 2022 1 次提交
    • P
      Fix HyperClockCache Rollback bug in #10801 (#10843) · b6e33dbc
      Peter Dillinger 提交于
      Summary:
      In https://github.com/facebook/rocksdb/issues/10801 in ClockHandleTable::Evict, we saved a reference to the hash value (`const UniqueId64x2& hashed_key`) instead of saving the hash value itself before marking the handle as empty and thus free for use by other threads. This could lead to Rollback seeing the wrong hash value for updating the `displacements` after an entry is removed.
      
      The fix is (like other places) to copy the hash value before it's released. (We could Rollback while we own the entry, but that creates more dependences between atomic updates, because in that case, based on the code, the Rollback writes would have to happen before or after the entry is released by marking empty. By doing the relaxed Rollback after marking empty, there's more opportunity for re-ordering / ILP.)
      
      Intended follow-up: refactoring for better code sharing in clock_cache.cc
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10843
      
      Test Plan: watch for clean crash test, TSAN
      
      Reviewed By: siying
      
      Differential Revision: D40579680
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 258e43b3b80bc980a161d5c675ccc6708ecb8025
      b6e33dbc
  14. 19 10月, 2022 1 次提交
    • P
      Refactor ShardedCache for more sharing, static polymorphism (#10801) · 7555243b
      Peter Dillinger 提交于
      Summary:
      The motivations for this change include
      * Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size.
        * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash.
      * Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads
        * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks).
        * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache.
      
      More detail:
      * Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`.
      * Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards.
      * Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801
      
      Test Plan:
      Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash.
      
      Performance:
      Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16`
      
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16
      ```
      
      Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec`
      After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec`
      
      So possibly ~0.1% improvement.
      
      And with `-cache_type=hyper_clock_cache`:
      Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec`
      After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec`
      
      So roughly 5% improvement!
      
      Reviewed By: anand1976
      
      Differential Revision: D40252236
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b
      7555243b
  15. 18 10月, 2022 1 次提交
    • P
      Print stack traces on frozen tests in CI (#10828) · e466173d
      Peter Dillinger 提交于
      Summary:
      Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
      
      For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
      
      Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
      
      Reviewed By: hx235
      
      Differential Revision: D40447634
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
      e466173d
  16. 07 10月, 2022 1 次提交
    • P
      Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768) · b205c6d0
      Peter Dillinger 提交于
      Summary:
      We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
      
      This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
      
      Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
      
      Reviewed By: anand1976
      
      Differential Revision: D40034747
      
      Pulled By: anand1976
      
      fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
      b205c6d0
  17. 04 10月, 2022 1 次提交
    • P
      Some clean-up of secondary cache (#10730) · 5f4391dd
      Peter Dillinger 提交于
      Summary:
      This is intended as a step toward possibly separating secondary cache integration from the
      Cache implementation as much as possible, to (hopefully) minimize code duplication in
      adding secondary cache support to HyperClockCache.
      * Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles.
      * Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle).
      * Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction.
      * Drop "incomplete" terminology, which was another name for "pending".
      * Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady().
      * Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.)
      * Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().)
      
      Intended follow-up:
      * Clarify APIs in secondary_cache.h
        * Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)?
        * Does Wait() need to be called if IsReady() == true? (This would be different from Cache.)
        * Do Value() and Size() have undefined behavior if IsReady() == false?
        * Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>?
      * Improve unit testing of standalone handle case
      * Apparent null `e` bug in `free_standalone_handle` case
      * Clean up secondary cache testing in lru_cache_test
        * Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle?
        * Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true?
        * (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation?
        * Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady().
      * Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730
      
      Test Plan:
      existing tests (minor update), and crash test with sanitizers and secondary cache
      
      Performance test for any regressions in LRUCache (primary only):
      Create DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test before & after (run at same time) with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16
      ```
      Before: readrandom [AVG    100 runs] : 22234 (± 63) ops/sec;    1.6 (± 0.0) MB/sec
      After: readrandom [AVG    100 runs] : 22197 (± 64) ops/sec;    1.6 (± 0.0) MB/sec
      That's within 0.2%, which is not significant by the confidence intervals.
      
      Reviewed By: anand1976
      
      Differential Revision: D39826010
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44
      5f4391dd
  18. 30 9月, 2022 1 次提交
  19. 22 9月, 2022 1 次提交
  20. 17 9月, 2022 2 次提交
    • G
      Add enable_split_merge option for CompressedSecondaryCache (#10690) · 2cc5b395
      gitbw95 提交于
      Summary:
      `enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690
      
      Test Plan:
      Unit Tests
      Stress Tests
      
      Reviewed By: anand1976
      
      Differential Revision: D39567604
      
      Pulled By: gitbw95
      
      fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
      2cc5b395
    • P
      Call experimental new clock cache HyperClockCache (#10684) · 0f91c72a
      Peter Dillinger 提交于
      Summary:
      This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
      * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
      * We want to highlight the key feature: it's fast (especially under parallel load)
      * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
      * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
      
      Some API detail:
      * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
      * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
      
      For performance tests see https://github.com/facebook/rocksdb/pull/10626
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
      
      Test Plan: no interesting functional changes; tests updated
      
      Reviewed By: anand1976
      
      Differential Revision: D39547800
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
      0f91c72a
  21. 16 9月, 2022 1 次提交
    • P
      Revamp, optimize new experimental clock cache (#10626) · 57243486
      Peter Dillinger 提交于
      Summary:
      * Consolidates most metadata into a single word per slot so that more
      can be accomplished with a single atomic update. In the common case,
      Lookup was previously about 4 atomic updates, now just 1 atomic update.
      Common case Release was previously 1 atomic read + 1 atomic update,
      now just 1 atomic update.
      * Eliminate spins / waits / yields, which likely threaten some "lock free"
      benefits. Compare-exchange loops are only used in explicit Erase, and
      strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
      exchange.
      * Relaxes some aggressiveness and guarantees. For example,
        * Duplicate Inserts will sometimes go undetected and the shadow duplicate
          will age out with eviction.
        * In many cases, the older Inserted value for a given cache key will be kept
        (i.e. Insert does not support overwrite).
        * Entries explicitly erased (rather than evicted) might not be freed
        immediately in some rare cases.
        * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
        precisely as LRUCache, but is self-correcting and should only deviate by a
        very small number of extra or fewer entries.
      * Use smaller "computed default" number of cache shards in many cases,
      because benefits to larger usage tracking / eviction pools outweigh the small
      cost of more lock-free atomic contention. The improvement in CPU and I/O
      is dramatic in some limit-memory cases.
      * Even without the sharding change, the eviction algorithm is likely more
      effective than LRU overall because it's more stateful, even though the
      "hot path" state tracking for it is essentially free with ref counting. It
      is like a generalized CLOCK with aging (see code comments). I don't have
      performance numbers showing a specific improvement, but in theory, for a
      Poisson access pattern to each block, keeping some state allows better
      estimation of time to next access (Poisson interval) than strict LRU. The
      bounded randomness in CLOCK can also reduce "cliff" effect for repeated
      range scans approaching and exceeding cache size.
      
      ## Hot path algorithm comparison
      Rough descriptions, focusing on number and kind of atomic operations:
      * Old `Lookup()` (2-5 atomic updates per probe):
      ```
      Loop:
        Increment internal ref count at slot
        If possible hit:
          Check flags atomic (and non-atomic fields)
          If cache hit:
            Three distinct updates to 'flags' atomic
            Increment refs for internal-to-external
            Return
        Decrement internal ref count
      while atomic read 'displacements' > 0
      ```
      * New `Lookup()` (1-2 atomic updates per probe):
      ```
      Loop:
        Increment acquire counter in meta word (optimistic)
        If visible entry (already read meta word):
          If match (read non-atomic fields):
            Return
          Else:
            Decrement acquire counter in meta word
        Else if invisible entry (rare, already read meta word):
          Decrement acquire counter in meta word
      while atomic read 'displacements' > 0
      ```
      * Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
      ```
      Read atomic ref count
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      Else:
        Decrement ref count
      ```
      * New `Release()` (1 unconditional atomic update, rarely more):
      ```
      Increment release counter in meta word
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      ```
      
      ## Performance test setup
      Build DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
      ```
      Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
      
      base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
      folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
      gt_clock: experimental ClockCache before this change
      new_clock: experimental ClockCache with this change
      
      ## Performance test results
      First test "hot path" read performance, with block cache large enough for whole DB:
      4181MB 1thread base -> kops/s: 47.761
      4181MB 1thread folly -> kops/s: 45.877
      4181MB 1thread gt_clock -> kops/s: 51.092
      4181MB 1thread new_clock -> kops/s: 53.944
      
      4181MB 16thread base -> kops/s: 284.567
      4181MB 16thread folly -> kops/s: 249.015
      4181MB 16thread gt_clock -> kops/s: 743.762
      4181MB 16thread new_clock -> kops/s: 861.821
      
      4181MB 24thread base -> kops/s: 303.415
      4181MB 24thread folly -> kops/s: 266.548
      4181MB 24thread gt_clock -> kops/s: 975.706
      4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
      
      4181MB 32thread base -> kops/s: 311.251
      4181MB 32thread folly -> kops/s: 274.952
      4181MB 32thread gt_clock -> kops/s: 1045.98
      4181MB 32thread new_clock -> kops/s: 1370.38
      
      4181MB 48thread base -> kops/s: 310.504
      4181MB 48thread folly -> kops/s: 268.322
      4181MB 48thread gt_clock -> kops/s: 1195.65
      4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
      
      4181MB 64thread base -> kops/s: 307.839
      4181MB 64thread folly -> kops/s: 272.172
      4181MB 64thread gt_clock -> kops/s: 1204.47
      4181MB 64thread new_clock -> kops/s: 1615.37
      
      4181MB 128thread base -> kops/s: 310.934
      4181MB 128thread folly -> kops/s: 267.468
      4181MB 128thread gt_clock -> kops/s: 1188.75
      4181MB 128thread new_clock -> kops/s: 1595.46
      
      Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
      
      Now test a large block cache with low miss ratio, but some eviction is required:
      1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
      1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
      1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
      1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
      
      1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
      1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
      1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
      1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
      
      1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
      1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
      1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
      1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
      
      610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
      610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
      610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
      610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
      
      610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
      610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
      610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
      610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
      
      610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
      610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
      610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
      610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
      
      The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
      
      Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
      
      233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
      233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
      233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
      233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
      
      233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
      233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
      233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
      233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
      
      233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
      233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
      233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
      233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
      
      89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
      89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
      89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
      89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
      
      89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
      89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
      89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
      89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
      
      89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
      89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
      89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
      89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
      
      ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
      
      Even smaller cache size:
      34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
      34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
      34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
      34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
      
      34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
      34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
      34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
      34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
      
      34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
      34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
      34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
      34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
      
      As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
      
      13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
      13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
      13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
      13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
      
      13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
      13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
      13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
      13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
      
      13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
      13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
      13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
      13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
      
      gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
      
      13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
      13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
      13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
      13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
      
      13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
      13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
      13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
      13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
      
      Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
      
      Reviewed By: anand1976
      
      Differential Revision: D39368406
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
      57243486
  22. 09 9月, 2022 1 次提交
  23. 08 9月, 2022 1 次提交
    • B
      Avoid recompressing cold block in CompressedSecondaryCache (#10527) · d490bfcd
      Bo Wang 提交于
      Summary:
      **Summary:**
      When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.
      
      When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.
      
      **Implementation Details**
      Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)
      
      The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
      1. If a handle is found in primary cache:
        1.1. If the handle's value is not nullptr, it is returned immediately.
        1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
          - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
          - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
      2. If a handle is not found in primary cache:
        2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
        2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.
      
      The behaviors of `LRUCacheShard::Promote()` are updated as follows:
      1. If `e->sec_handle` has value, one of the following steps can happen:
        1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
        1.2. Insert the item into the primary cache and return the handle to caller.
        1.3. Exception handling.
      3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.
      
      The behavior of  `CompressedSecondaryCache::Insert()` is updated:
      1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
      4. If a dummy item is found for a block, the block is inserted into the secondary cache.
      
      The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
      1. If a handle is not found or it is a dummy item, a nullptr is returned.
      2. If `erase_handle` is true, the handle is erased.
      
      The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527
      
      Test Plan:
      1. stress tests.
      5. unit tests.
      6. CPU profiling for db_bench.
      
      Reviewed By: siying
      
      Differential Revision: D38747613
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
      d490bfcd
  24. 30 8月, 2022 1 次提交
    • L
      Add a dedicated cache entry role for blobs (#10601) · 78185601
      Levi Tamasi 提交于
      Summary:
      The patch adds a dedicated cache entry role for blob values and switches
      to a registered deleter so that blobs show up as a separate bucket
      (as opposed to "Misc") in the cache occupancy statistics, e.g.
      
      ```
      Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601
      
      Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D39107915
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125
      78185601
  25. 13 8月, 2022 2 次提交
    • G
      Add a blob-specific cache priority (#10461) · 275cd80c
      Gang Liao 提交于
      Summary:
      RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
      
      Reviewed By: siying
      
      Differential Revision: D38672823
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
      275cd80c
    • P
      Derive cache keys from SST unique IDs (#10394) · 86a1e3e0
      Peter Dillinger 提交于
      Summary:
      ... so that cache keys can be derived from DB manifest data
      before reading the file from storage--so that every part of the file
      can potentially go in a persistent cache.
      
      See updated comments in cache_key.cc for technical details. Importantly,
      the new cache key encoding uses some fancy but efficient math to pack
      data into the cache key without depending on the sizes of the various
      pieces. This simplifies some existing code creating cache keys, like
      cache warming before the file size is known.
      
      This should provide us an essentially permanent mapping between SST
      unique IDs and base cache keys, with the ability to "upgrade" SST
      unique IDs (and thus cache keys) with new SST format_versions.
      
      These cache keys are of similar, perhaps indistinguishable quality to
      the previous generation. Before this change (see "corrected" days
      between collision):
      
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=43
      18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
      ```
      
      After this change (keep 43 bits, up through 50, to validate "trajectory"
      is ok on "corrected" days between collision):
      ```
      19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
      16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
      15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
      15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
      15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
      15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
      15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
      15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
      ```
      
      However, the change does modify (probably weaken) the "guaranteed unique" promise from this
      
      > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
      
      to this (see https://github.com/facebook/rocksdb/issues/10388)
      
      > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
      
      I don't think this is a practical concern, though.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
      
      Test Plan: unit tests updated, see simulation results above
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38667529
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
      86a1e3e0
  26. 11 8月, 2022 1 次提交
    • G
      Enable ClockCache in DB block cache test (#10482) · a0798f6f
      Guido Tagliavini Ponce 提交于
      Summary:
      A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: anand1976
      
      Differential Revision: D38553073
      
      Pulled By: guidotag
      
      fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae
      a0798f6f
  27. 10 8月, 2022 1 次提交
    • G
      Fix the segdefault bug in CompressedSecondaryCache and its tests (#10507) · f060b47e
      gitbw95 提交于
      Summary:
      This fix is to replace `AllocateBlock()` with `new`. Once I figure out why `AllocateBlock()` might cause the segfault, I will update the implementation.
      
      Fix the bug that causes ./compressed_secondary_cache_test output following test failures:
      
      ```
      Note: Google Test filter = CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from CompressedSecondaryCacheTest
      [ RUN      ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      [       OK ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest (1 ms)
      [----------] 1 test from CompressedSecondaryCacheTest (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (9 ms total)
      [  PASSED  ] 1 test.
      t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.MergeChunksIntoValueTest: line 4: 1091086 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      Note: Google Test filter = CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from CompressedSecondaryCacheTest
      [ RUN      ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      [       OK ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression (1 ms)
      [----------] 1 test from CompressedSecondaryCacheTest (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (2 ms total)
      [  PASSED  ] 1 test.
      t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression: line 4: 1090883 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10507
      
      Test Plan:
      Test 1:
      ```
      $make -j 24
      $./compressed_secondary_cache_test
      ```
      Test 2:
      ```
      $COMPILE_WITH_ASAN=1  make -j 24
      $./compressed_secondary_cache_test
      ```
      Test 3:
      ```
      $COMPILE_WITH_TSAN=1 make -j 24
      $./compressed_secondary_cache_test
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D38529885
      
      Pulled By: gitbw95
      
      fbshipit-source-id: d903fa3fadbd4d29f9528728c63a4f61c4396890
      f060b47e
  28. 05 8月, 2022 1 次提交
  29. 03 8月, 2022 1 次提交
    • B
      Split cache to minimize internal fragmentation (#10287) · 87b82f28
      Bo Wang 提交于
      Summary:
      ### **Summary:**
      To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.
      
      _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._
      
      To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.
      
      **T1**
      Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
      Values:     100 bytes each (50 bytes after compression)
      Entries:    90000000
      RawSize:    9956.4 MB (estimated)
      FileSize:   5664.8 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T1_3 | 4000 | 4000 |
      | T1_4 | 2000 | 3000 |
      
      Populate the DB:
      ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1
      
      Run read tests with differnt cache setting:
      T1_3:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &
      
      T1_4:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &
      
      For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
      | - | - | - | - | - |
      | allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
      | available (MB)  | 8753 | 8092 | 5536 | 5051 |
      | external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
      | resident (MB)  | 8956 | 8365 | 5655 | 5235 |
      
      **T2**
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     256 bytes each (128 bytes after compression)
      Entries:    40000000
      RawSize:    10986.3 MB (estimated)
      FileSize:   6103.5 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T2_3 | 4000 | 4000 |
      | T2_4 | 2000 | 3000 |
      
      Create DB (10GB):
      ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      
      Run read tests with differnt cache setting:
      T2_3:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &
      
      T2_4:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &
      
      For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
      | -  | - | - | - | - |
      | allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
      | available (MB)  | 8489 | 8138 | 5435 | 5158 |
      | external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
      | resident (MB)  | 8676 | 8392 | 5541 | 5321 |
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287
      
      Test Plan: Unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D37743362
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad
      87b82f28
  30. 30 7月, 2022 1 次提交
    • A
      Fix cache metrics update when secondary cache is used (#10440) · 54aebb2c
      anand76 提交于
      Summary:
      If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```).
      
      To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions.
      
      Test plan -
      Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440
      
      Reviewed By: gitbw95
      
      Differential Revision: D38283968
      
      Pulled By: anand1976
      
      fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1
      54aebb2c
  31. 29 7月, 2022 1 次提交
  32. 28 7月, 2022 2 次提交
    • G
      Add a blob-specific cache priority (#10309) · 8d178090
      Gang Liao 提交于
      Summary:
      RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
      
      Reviewed By: ltamasi
      
      Differential Revision: D38211655
      
      Pulled By: gangliao
      
      fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
      8d178090
    • G
      Fix assertion failure and memory leak in ClockCache. (#10430) · d976f689
      Guido Tagliavini Ponce 提交于
      Summary:
      This fixes two issues:
      - [T127355728](https://www.internalfb.com/intern/tasks/?t=127355728): In the stress tests, when the ClockCache is operating close to full capacity and a burst of inserts are concurrently executed, every slot in the hash table may become occupied. This contradicts an assertion in the code, which is no longer valid in the lock-free setting. We are removing that assertion and handling the case of an insertion into a full table.
      - [T127427659](https://www.internalfb.com/intern/tasks/?t=127427659): There was a memory leak when an insertion is performed over capacity, but no handle is provided. In that case, a handle was dynamically allocated, but the pointer wasn't stored anywhere.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10430
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38226114
      
      Pulled By: guidotag
      
      fbshipit-source-id: 18f6ab7e6214e11e9721d5ff289db1bf795d0008
      d976f689
  33. 27 7月, 2022 1 次提交
    • G
      Towards a production-quality ClockCache (#10418) · 9d7de651
      Guido Tagliavini Ponce 提交于
      Summary:
      In this PR we bring ClockCache closer to production quality. We implement the following changes:
      1. Fixed a few bugs in ClockCache.
      2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
      3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
      4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
      5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
      
      As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
      
      Test Plan:
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38170673
      
      Pulled By: guidotag
      
      fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
      9d7de651
  34. 26 7月, 2022 2 次提交
    • P
      Account for DB ID in stress testing block cache keys (#10388) · 01a2e202
      Peter Dillinger 提交于
      Summary:
      I recently discovered that block cache keys are slightly lower
      quality than previously thought, because my stress testing tool failed
      to simulate the effect of DB ID differences. This change updates the
      tool and gives us data to guide future developments. (No changes to
      production code here and now.)
      
      Nevertheless, the following promise still holds
      
      ```
      // In fact, if our SST files are all < 4TB (see
      // BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated
      // in a single process are guaranteed to have unique cache keys, unless/until
      // number session ids * max file number = 2**86 ...
      ```
      
      because although different DB IDs could cause collision in file number
      and offset data, that would have to be using the same DB session (lower)
      to cause a block cache key collision, which is not possible in the same
      process. (A session is associated with only one DB ID.)
      
      This change fixes cache_bench -stress_cache_key to set and reset DB IDs in
      a parameterized way to evaluate the effect. Previous results assumed to
      be representative (using -sck_keep_bits=43):
      
      ```
      15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected)
      ```
      
      or expected collision on a single machine every 104 billion billion
      days (see "corrected" value).
      
      After accounting for DB IDs, test never really changing, intermediate, and very
      frequently changing (using default -sck_db_count=100):
      
      ```
      -sck_newdb_nreopen=1000000000:
      15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected)
      -sck_newdb_nreopen=10000:
      17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected)
      -sck_newdb_nreopen=100:
      19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected)
      ```
      
      or roughly 10x more often than previously thought (still extremely if
      not impossibly rare), and better than random base cache keys
      (with -sck_randomize), though < 10x better than random:
      
      ```
      31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected)
      ```
      
      If we simply fixed this by ignoring DB ID for cache keys, we would
      potentially have a shortage of entropy for some cases, such as small
      file numbers and offsets (e.g. many short-lived processes each using
      SstFileWriter to create a small file), because existing DB session IDs
      only provide ~103 bits of entropy. We could upgrade the entropy in DB
      session IDs to accommodate, but it's not known what all would be
      affected by changing from 20 digit session IDs to something larger.
      
      Instead, my plan is to
      1) Move to block cache keys derived from SST unique IDs (so that we can
      derive block cache keys from manifest data without reading file on
      storage), and show no significant regression in expected collision
      rate.
      2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058),
      which should have ~100x lower expected/predicted collision rate based
      on simulations with this stress test:
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id
      ...
      15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388
      
      Test Plan: no production changes
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37986714
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df
      01a2e202
    • G
      Lock-free ClockCache (#10390) · 6a160e1f
      Guido Tagliavini Ponce 提交于
      Summary:
      ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38106945
      
      Pulled By: guidotag
      
      fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1
      6a160e1f
  35. 19 7月, 2022 1 次提交