1. 10 8月, 2022 1 次提交
    • G
      Fix the segdefault bug in CompressedSecondaryCache and its tests (#10507) · f060b47e
      gitbw95 提交于
      Summary:
      This fix is to replace `AllocateBlock()` with `new`. Once I figure out why `AllocateBlock()` might cause the segfault, I will update the implementation.
      
      Fix the bug that causes ./compressed_secondary_cache_test output following test failures:
      
      ```
      Note: Google Test filter = CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from CompressedSecondaryCacheTest
      [ RUN      ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      [       OK ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest (1 ms)
      [----------] 1 test from CompressedSecondaryCacheTest (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (9 ms total)
      [  PASSED  ] 1 test.
      t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.MergeChunksIntoValueTest: line 4: 1091086 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      Note: Google Test filter = CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from CompressedSecondaryCacheTest
      [ RUN      ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      [       OK ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression (1 ms)
      [----------] 1 test from CompressedSecondaryCacheTest (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (2 ms total)
      [  PASSED  ] 1 test.
      t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression: line 4: 1090883 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10507
      
      Test Plan:
      Test 1:
      ```
      $make -j 24
      $./compressed_secondary_cache_test
      ```
      Test 2:
      ```
      $COMPILE_WITH_ASAN=1  make -j 24
      $./compressed_secondary_cache_test
      ```
      Test 3:
      ```
      $COMPILE_WITH_TSAN=1 make -j 24
      $./compressed_secondary_cache_test
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D38529885
      
      Pulled By: gitbw95
      
      fbshipit-source-id: d903fa3fadbd4d29f9528728c63a4f61c4396890
      f060b47e
  2. 05 8月, 2022 1 次提交
  3. 03 8月, 2022 1 次提交
    • B
      Split cache to minimize internal fragmentation (#10287) · 87b82f28
      Bo Wang 提交于
      Summary:
      ### **Summary:**
      To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.
      
      _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._
      
      To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.
      
      **T1**
      Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
      Values:     100 bytes each (50 bytes after compression)
      Entries:    90000000
      RawSize:    9956.4 MB (estimated)
      FileSize:   5664.8 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T1_3 | 4000 | 4000 |
      | T1_4 | 2000 | 3000 |
      
      Populate the DB:
      ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1
      
      Run read tests with differnt cache setting:
      T1_3:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &
      
      T1_4:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &
      
      For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
      | - | - | - | - | - |
      | allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
      | available (MB)  | 8753 | 8092 | 5536 | 5051 |
      | external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
      | resident (MB)  | 8956 | 8365 | 5655 | 5235 |
      
      **T2**
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     256 bytes each (128 bytes after compression)
      Entries:    40000000
      RawSize:    10986.3 MB (estimated)
      FileSize:   6103.5 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T2_3 | 4000 | 4000 |
      | T2_4 | 2000 | 3000 |
      
      Create DB (10GB):
      ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      
      Run read tests with differnt cache setting:
      T2_3:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &
      
      T2_4:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &
      
      For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
      | -  | - | - | - | - |
      | allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
      | available (MB)  | 8489 | 8138 | 5435 | 5158 |
      | external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
      | resident (MB)  | 8676 | 8392 | 5541 | 5321 |
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287
      
      Test Plan: Unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D37743362
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad
      87b82f28
  4. 30 7月, 2022 1 次提交
    • A
      Fix cache metrics update when secondary cache is used (#10440) · 54aebb2c
      anand76 提交于
      Summary:
      If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```).
      
      To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions.
      
      Test plan -
      Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440
      
      Reviewed By: gitbw95
      
      Differential Revision: D38283968
      
      Pulled By: anand1976
      
      fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1
      54aebb2c
  5. 29 7月, 2022 1 次提交
  6. 28 7月, 2022 2 次提交
    • G
      Add a blob-specific cache priority (#10309) · 8d178090
      Gang Liao 提交于
      Summary:
      RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
      
      Reviewed By: ltamasi
      
      Differential Revision: D38211655
      
      Pulled By: gangliao
      
      fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
      8d178090
    • G
      Fix assertion failure and memory leak in ClockCache. (#10430) · d976f689
      Guido Tagliavini Ponce 提交于
      Summary:
      This fixes two issues:
      - [T127355728](https://www.internalfb.com/intern/tasks/?t=127355728): In the stress tests, when the ClockCache is operating close to full capacity and a burst of inserts are concurrently executed, every slot in the hash table may become occupied. This contradicts an assertion in the code, which is no longer valid in the lock-free setting. We are removing that assertion and handling the case of an insertion into a full table.
      - [T127427659](https://www.internalfb.com/intern/tasks/?t=127427659): There was a memory leak when an insertion is performed over capacity, but no handle is provided. In that case, a handle was dynamically allocated, but the pointer wasn't stored anywhere.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10430
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38226114
      
      Pulled By: guidotag
      
      fbshipit-source-id: 18f6ab7e6214e11e9721d5ff289db1bf795d0008
      d976f689
  7. 27 7月, 2022 1 次提交
    • G
      Towards a production-quality ClockCache (#10418) · 9d7de651
      Guido Tagliavini Ponce 提交于
      Summary:
      In this PR we bring ClockCache closer to production quality. We implement the following changes:
      1. Fixed a few bugs in ClockCache.
      2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
      3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
      4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
      5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
      
      As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
      
      Test Plan:
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38170673
      
      Pulled By: guidotag
      
      fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
      9d7de651
  8. 26 7月, 2022 2 次提交
    • P
      Account for DB ID in stress testing block cache keys (#10388) · 01a2e202
      Peter Dillinger 提交于
      Summary:
      I recently discovered that block cache keys are slightly lower
      quality than previously thought, because my stress testing tool failed
      to simulate the effect of DB ID differences. This change updates the
      tool and gives us data to guide future developments. (No changes to
      production code here and now.)
      
      Nevertheless, the following promise still holds
      
      ```
      // In fact, if our SST files are all < 4TB (see
      // BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated
      // in a single process are guaranteed to have unique cache keys, unless/until
      // number session ids * max file number = 2**86 ...
      ```
      
      because although different DB IDs could cause collision in file number
      and offset data, that would have to be using the same DB session (lower)
      to cause a block cache key collision, which is not possible in the same
      process. (A session is associated with only one DB ID.)
      
      This change fixes cache_bench -stress_cache_key to set and reset DB IDs in
      a parameterized way to evaluate the effect. Previous results assumed to
      be representative (using -sck_keep_bits=43):
      
      ```
      15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected)
      ```
      
      or expected collision on a single machine every 104 billion billion
      days (see "corrected" value).
      
      After accounting for DB IDs, test never really changing, intermediate, and very
      frequently changing (using default -sck_db_count=100):
      
      ```
      -sck_newdb_nreopen=1000000000:
      15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected)
      -sck_newdb_nreopen=10000:
      17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected)
      -sck_newdb_nreopen=100:
      19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected)
      ```
      
      or roughly 10x more often than previously thought (still extremely if
      not impossibly rare), and better than random base cache keys
      (with -sck_randomize), though < 10x better than random:
      
      ```
      31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected)
      ```
      
      If we simply fixed this by ignoring DB ID for cache keys, we would
      potentially have a shortage of entropy for some cases, such as small
      file numbers and offsets (e.g. many short-lived processes each using
      SstFileWriter to create a small file), because existing DB session IDs
      only provide ~103 bits of entropy. We could upgrade the entropy in DB
      session IDs to accommodate, but it's not known what all would be
      affected by changing from 20 digit session IDs to something larger.
      
      Instead, my plan is to
      1) Move to block cache keys derived from SST unique IDs (so that we can
      derive block cache keys from manifest data without reading file on
      storage), and show no significant regression in expected collision
      rate.
      2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058),
      which should have ~100x lower expected/predicted collision rate based
      on simulations with this stress test:
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id
      ...
      15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388
      
      Test Plan: no production changes
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37986714
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df
      01a2e202
    • G
      Lock-free ClockCache (#10390) · 6a160e1f
      Guido Tagliavini Ponce 提交于
      Summary:
      ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38106945
      
      Pulled By: guidotag
      
      fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1
      6a160e1f
  9. 19 7月, 2022 1 次提交
  10. 16 7月, 2022 3 次提交
    • G
      Support using secondary cache with the blob cache (#10349) · 95ef007a
      Gang Liao 提交于
      Summary:
      RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349
      
      Reviewed By: ltamasi
      
      Differential Revision: D37896773
      
      Pulled By: gangliao
      
      fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
      95ef007a
    • G
      Lock-free Lookup and Release in ClockCache (#10347) · efdb428e
      Guido Tagliavini Ponce 提交于
      Summary:
      This is a prototype of a partially lock-free version of ClockCache. Roughly speaking, reads are lock-free and writes are lock-based:
      - Lookup is lock-free.
      - Release is lock-free, unless (i) no references to the element are left and (ii) it was marked for deletion or ``erase_if_last_ref`` is set.
      - Insert and Erase still use a per-shard lock.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10347
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37898776
      
      Pulled By: guidotag
      
      fbshipit-source-id: 6418fd980f786d69b871bf2fe959398e44cd3d80
      efdb428e
    • G
      Add lean option to cache_bench (#10363) · a543773b
      Guido Tagliavini Ponce 提交于
      Summary:
      Sometimes we may not want to include extra computation in our cache_bench experiments. Here we add a flag to avoid any extra work. We also moved the timer start after the key generation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10363
      
      Test Plan: Run cache_bench with and without the new flag and check that the appropriate code is being executed.
      
      Reviewed By: pdillinger
      
      Differential Revision: D37870416
      
      Pulled By: guidotag
      
      fbshipit-source-id: f853207b6643b9328e774251c3f679b1fd78a11a
      a543773b
  11. 14 7月, 2022 2 次提交
  12. 13 7月, 2022 1 次提交
    • G
      Temporarily return a LRUCache from NewClockCache (#10351) · 9645e66f
      Guido Tagliavini Ponce 提交于
      Summary:
      ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.
      
      The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.
      
      A new version of the NewClockCache function was created for our internal tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351
      
      Test Plan: ``make -j24 check`` and re-run the pre-release tests.
      
      Reviewed By: pdillinger
      
      Differential Revision: D37802685
      
      Pulled By: guidotag
      
      fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
      9645e66f
  13. 07 7月, 2022 3 次提交
    • G
      Eliminate the copying of blobs when serving reads from the cache (#10297) · c987eb47
      Gang Liao 提交于
      Summary:
      The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)
      
      This has the potential to save a lot of CPU, especially with large blob values.
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297
      
      Reviewed By: riversand963
      
      Differential Revision: D37640311
      
      Pulled By: gangliao
      
      fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
      c987eb47
    • G
      Midpoint insertions in ClockCache (#10305) · c277aeb4
      Guido Tagliavini Ponce 提交于
      Summary:
      When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37607787
      
      Pulled By: guidotag
      
      fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040
      c277aeb4
    • P
      Have Cache use Status::MemoryLimit (#10262) · e6c5e0ab
      Peter Dillinger 提交于
      Summary:
      I noticed it would clean up some things to have Cache::Insert()
      return our MemoryLimit Status instead of Incomplete for the case in
      which the capacity limit is reached. I suspect this fixes some existing but
      unknown bugs where this Incomplete could be confused with other uses
      of Incomplete, especially no_io cases. This is the most suspicious case I
      noticed, but was not able to reproduce a bug, in part because the existing
      code is not covered by unit tests (FIXME added): https://github.com/facebook/rocksdb/blob/57adbf0e9187331cb39bf5cdb5f5d67faeee5f63/table/get_context.cc#L397
      
      I audited all the existing uses of IsIncomplete and updated those that
      seemed relevant.
      
      HISTORY updated with a clear warning to users of strict_capacity_limit=true
      to update uses of `IsIncomplete()`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
      
      Test Plan: updated unit tests
      
      Reviewed By: hx235
      
      Differential Revision: D37473155
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
      e6c5e0ab
  14. 02 7月, 2022 1 次提交
    • G
      Fix CalcHashBits (#10295) · 54f678cd
      Guido Tagliavini Ponce 提交于
      Summary:
      We fix two bugs in CalcHashBits. The first one is an off-by-one error: the desired number of table slots is the real number ``capacity / (kLoadFactor * handle_charge)``, which should not be rounded down. The second one is that we should disallow inputs that set the element charge to 0, namely ``estimated_value_size == 0 && metadata_charge_policy == kDontChargeCacheMetadata``.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10295
      
      Test Plan: CalcHashBits is tested by CalcHashBitsTest (in lru_cache_test.cc). The test now iterates over many more inputs; it covers, in particular, the rounding error edge case. Overall, the test is now more robust. Run ``make -j24 check``.
      
      Reviewed By: pdillinger
      
      Differential Revision: D37573797
      
      Pulled By: guidotag
      
      fbshipit-source-id: ea4f4439f7196ab1c1afb88f566fe92850537262
      54f678cd
  15. 30 6月, 2022 1 次提交
    • G
      Clock cache (#10273) · 57a0e2f3
      Guido Tagliavini Ponce 提交于
      Summary:
      This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
      - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
      - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
      - Middle insertions into the clock list.
      - A test that exercises the clock eviction policy.
      - Update the Java API of ClockCache and Java calls to C++.
      
      Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37522461
      
      Pulled By: guidotag
      
      fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
      57a0e2f3
  16. 28 6月, 2022 1 次提交
  17. 24 6月, 2022 1 次提交
    • G
      Fix key size in cache_bench (#10234) · b52620ab
      Guido Tagliavini Ponce 提交于
      Summary:
      cache_bench wasn't generating 16B keys, which are necessary for FastLRUCache. Also:
      - Added asserts in cache_bench, which is assuming that inserts never fail. When they fail (for example, if we used keys of the wrong size), memory allocated to the values will becomes leaked, and eventually the program crashes.
      - Move kCacheKeySize to the right spot.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10234
      
      Test Plan:
      ``make -j24 check``. Also, run cache_bench with FastLRUCache and check that memory usage doesn't blow up:
      ``./cache_bench -cache_type=fast_lru_cache -num_shard_bits=6 -skewed=true \
                              -lookup_insert_percent=100 -lookup_percent=0 -insert_percent=0 -erase_percent=0 \
                              -populate_cache=true -cache_size=1073741824 -ops_per_thread=10000000 \
                              -value_bytes=8192 -resident_ratio=1 -threads=16``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37382949
      
      Pulled By: guidotag
      
      fbshipit-source-id: b697a942ebb215de5d341f98dc8566763436ba9b
      b52620ab
  18. 21 6月, 2022 1 次提交
    • G
      Replace per-shard chained hash tables with open-addressing scheme (#10194) · 3afed740
      Guido Tagliavini Ponce 提交于
      Summary:
      In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.
      
      Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37296770
      
      Pulled By: guidotag
      
      fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
      3afed740
  19. 18 6月, 2022 1 次提交
    • P
      Use optimized folly DistributedMutex in LRUCache when available (#10179) · 1aac8145
      Peter Dillinger 提交于
      Summary:
      folly DistributedMutex is faster than standard mutexes though
      imposes some static obligations on usage. See
      https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
      for details. Here we use this alternative for our Cache implementations
      (especially LRUCache) for better locking performance, when RocksDB is
      compiled with folly.
      
      Also added information about which distributed mutex implementation is
      being used to cache_bench output and to DB LOG.
      
      Intended follow-up:
      * Use DMutex in more places, perhaps improving API to support non-scoped
      locking
      * Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)
      
      Credit: Thanks Siying for reminding me about this line of work that was previously
      left unfinished.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179
      
      Test Plan:
      for correctness, existing tests. CircleCI config updated.
      Also Meta-internal buck build updated.
      
      For performance, ran simultaneous before & after cache_bench. Out of three
      comparison runs, the middle improvement to ops/sec was +21%:
      
      Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
      compiler)
      
      ```
      Complete in 20.201 s; Rough parallel ops/sec = 1584062
      Thread ops/sec = 107176
      
      Operation latency (ns):
      Count: 32000000 Average: 9257.9421  StdDev: 122412.04
      Min: 134  Median: 3623.0493  Max: 56918500
      Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
      ```
      
      New: (add USE_FOLLY=1)
      
      ```
      Complete in 16.674 s; Rough parallel ops/sec = 1919135  (+21%)
      Thread ops/sec = 135487
      
      Operation latency (ns):
      Count: 32000000 Average: 7304.9294  StdDev: 108530.28
      Min: 132  Median: 3777.6012  Max: 91030902
      Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D37182983
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
      1aac8145
  20. 15 6月, 2022 1 次提交
    • H
      Account memory of FileMetaData in global memory limit (#9924) · d665afdb
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
      
      Test Plan:
      - Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
      - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
      - db bench (CPU cost of `charge_file_metadata` in write and compact)
         - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
         - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
      
      table 1 - write
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
      80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
      160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
      
      table 2 - compact
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
      20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
      40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
      80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
      
      - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36055583
      
      Pulled By: hx235
      
      fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
      d665afdb
  21. 14 6月, 2022 1 次提交
    • G
      Make the per-shard hash table fixed-size. (#10154) · f105e1a5
      Guido Tagliavini Ponce 提交于
      Summary:
      We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
      
      Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
      
      Test Plan: `make -j24 check`
      
      Reviewed By: pdillinger
      
      Differential Revision: D37124451
      
      Pulled By: guidotag
      
      fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
      f105e1a5
  22. 11 6月, 2022 2 次提交
    • G
      Assume fixed size key (#10137) · 415200d7
      Guido Tagliavini Ponce 提交于
      Summary:
      FastLRUCache now only supports 16B keys. The tests have changed to reflect this.
      
      Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137
      
      Test Plan: make -j24 check
      
      Reviewed By: gitbw95
      
      Differential Revision: D37083934
      
      Pulled By: guidotag
      
      fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d
      415200d7
    • G
      Enable SecondaryCache::CreateFromString to create sec cache based on the uri... · f4052d13
      gitbw95 提交于
      Enable SecondaryCache::CreateFromString to create sec cache based on the uri for CompressedSecondaryCache (#10132)
      
      Summary:
      Update SecondaryCache::CreateFromString and enable it to create sec cache based on the uri for CompressedSecondaryCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10132
      
      Test Plan: Add unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D36996997
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 882ad563cff6d38b306a53426ad7e47273f34edc
      f4052d13
  23. 10 6月, 2022 1 次提交
  24. 04 6月, 2022 1 次提交
    • G
      Add support for FastLRUCache in cache_bench (#10095) · eb99e080
      Guido Tagliavini Ponce 提交于
      Summary:
      cache_bench can now run with FastLRUCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10095
      
      Test Plan:
      - Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 cache_bench``. Then test the appropriate code is used by running ``./cache_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
      - Verify that FastLRUCache (currently a clone of LRUCache) has similar latency distribution than LRUCache, by comparing the outputs of ``./cache_bench -cache_type=fast_lru_cache`` and ``./cache_bench -cache_type=lru_cache``.
      
      Reviewed By: pdillinger
      
      Differential Revision: D36875834
      
      Pulled By: guidotag
      
      fbshipit-source-id: eb2ad0bb32c2717a258a6ac66ed736e06f826cd8
      eb99e080
  25. 25 5月, 2022 1 次提交
    • S
      Avoid malloc_usable_size() call inside LRU Cache mutex (#10026) · c78a87cd
      sdong 提交于
      Summary:
      In LRU Cache mutex, we sometimes call malloc_usable_size() to calculate memory used by the metadata object. We prevent it by saving the charge + metadata size, rather than charge, inside the metadata itself. Within the mutex, usually only total charge is needed so we don't need to repeat.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10026
      
      Test Plan: Run existing tests.
      
      Reviewed By: pdillinger
      
      Differential Revision: D36556253
      
      fbshipit-source-id: f60c96d13cde3af77732e5548e4eac4182fa9801
      c78a87cd
  26. 07 5月, 2022 1 次提交
    • S
      Remove own ToString() (#9955) · 736a7b54
      sdong 提交于
      Summary:
      ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
      
      Test Plan: Watch CI tests
      
      Reviewed By: riversand963
      
      Differential Revision: D36176799
      
      fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
      736a7b54
  27. 04 5月, 2022 1 次提交
    • P
      Fork and simplify LRUCache for developing enhancements (#9917) · bb87164d
      Peter Dillinger 提交于
      Summary:
      To support a project to prototype and evaluate algorithmic
      enhancments and alternatives to LRUCache, here I have separated out
      LRUCache into internal-only "FastLRUCache" and cut it down to
      essentials, so that details like secondary cache handling and
      priorities do not interfere with prototyping. These can be
      re-integrated later as needed, along with refactoring to minimize code
      duplication (which would slow down prototyping for now).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9917
      
      Test Plan:
      unit tests updated to ensure basic functionality has (likely)
      been preserved
      
      Reviewed By: anand1976
      
      Differential Revision: D35995554
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d67b20b7ada3b5d3bfe56d897a73885894a1d9db
      bb87164d
  28. 15 4月, 2022 1 次提交
    • A
      Expose `CacheEntryRole` and map keys for block cache stat collections (#9838) · d6e016be
      Andrew Kryczka 提交于
      Summary:
      This gives users the ability to examine the map populated by `GetMapProperty()` with property `kBlockCacheEntryStats`. It also sets us up for a possible future where cache reservations are configured according to `CacheEntryRole`s rather than flags coupled to roles.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9838
      
      Test Plan:
      - migrated test DBBlockCacheTest.CacheEntryRoleStats to use this API. That test verifies some of the contents are as expected
      - added a DBPropertiesTest to verify the public map keys are present, and nothing else
      
      Reviewed By: hx235
      
      Differential Revision: D35629493
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5c4356b8560e85d1f881fd32c44c15960b02fc68
      d6e016be
  29. 13 4月, 2022 1 次提交
    • P
      Meta-internal folly integration with F14FastMap (#9546) · efd03516
      Peter Dillinger 提交于
      Summary:
      Especially after updating to C++17, I don't see a compelling case for
      *requiring* any folly components in RocksDB. I was able to purge the existing
      hard dependencies, and it can be quite difficult to strip out non-trivial components
      from folly for use in RocksDB. (The prospect of doing that on F14 has changed
      my mind on the best approach here.)
      
      But this change creates an optional integration where we can plug in
      components from folly at compile time, starting here with F14FastMap to replace
      std::unordered_map when possible (probably no public APIs for example). I have
      replaced the biggest CPU users of std::unordered_map with compile-time
      pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
      USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
      that is in the Makefile for public CI testing. A full folly build is not needed, but
      checking out the full folly repo is much simpler for getting the dependency,
      and anything else we might want to optionally integrate in the future.
      
      Some picky details:
      * I don't think the distributed mutex stuff is actually used, so it was easy to remove.
      * I implemented an alternative to `folly::constexpr_log2` (which is much easier
      in C++17 than C++11) so that I could pull out the hard dependencies on
      `ConstexprMath.h`
      * I had to add noexcept move constructors/operators to some types to make
      F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
      macro to make that easier in some common cases.
      * Updated Meta-internal buck build to use folly F14Map (always)
      
      No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
      production integration for open source users.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
      
      Test Plan:
      CircleCI tests updated so that a couple of them use folly.
      
      Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
      (Note: they should probably use buck but they currently use Makefile.)
      
      Example performance improvement: when filter partitions are pinned in cache,
      they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
      a test that exercises that heavily. Build DB with
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
      ```
      
      and test with (simultaneous runs with & without folly, ~20 times each to see
      convergence)
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
      ```
      
      Average ops/s no folly: 26229.2
      Average ops/s with folly: 26853.3 (+2.4%)
      
      Reviewed By: ajkr
      
      Differential Revision: D34181736
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
      efd03516
  30. 12 4月, 2022 1 次提交
    • G
      Prevent double caching in the compressed secondary cache (#9747) · f241d082
      gitbw95 提交于
      Summary:
      ###  **Summary:**
      When both LRU Cache and CompressedSecondaryCache are configured together, there possibly are some data blocks double cached.
      
      **Changes include:**
      1. Update IS_PROMOTED to IS_IN_SECONDARY_CACHE to prevent confusions.
      2. This PR updates SecondaryCacheResultHandle and use IsErasedFromSecondaryCache to determine whether the handle is erased in the secondary cache. Then, the caller can determine whether to SetIsInSecondaryCache().
      3. Rename LRUSecondaryCache to CompressedSecondaryCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9747
      
      Test Plan:
      **Test Scripts:**
      1. Populate a DB. The on disk footprint is 482 MB. The data is set to be 50% compressible, so the total decompressed size is expected to be 964 MB.
      ./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1
      
      2. overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1
      
      4. Run read tests with diffeernt cache setting:
      
      T1:
      ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000  --statistics -db=/db_bench_1
      
      T2:
      ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=320000000 -compressed_secondary_cache_size=400000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1
      
      T3:
      ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000 -compressed_secondary_cache_size=400000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1
      
      T4:
      ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=20000000 -compressed_secondary_cache_size=500000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1
      
      **Before this PR**
      | Cache Size | Compressed Secondary Cache Size | Cache Hit Rate |
      |------------|-------------------------------------|----------------|
      |520 MB | 0 MB | 85.5% |
      |320 MB | 400 MB | 96.2% |
      |520 MB | 400 MB | 98.3% |
      |20 MB | 500 MB | 98.8% |
      
      **Before this PR**
      | Cache Size | Compressed Secondary Cache Size | Cache Hit Rate |
      |------------|-------------------------------------|----------------|
      |520 MB | 0 MB | 85.5% |
      |320 MB | 400 MB | 99.9% |
      |520 MB | 400 MB | 99.9% |
      |20 MB | 500 MB | 99.2% |
      
      Reviewed By: anand1976
      
      Differential Revision: D35117499
      
      Pulled By: gitbw95
      
      fbshipit-source-id: ea2657749fc13efebe91a8a1b56bc61d6a224a12
      f241d082
  31. 07 4月, 2022 1 次提交
    • H
      Account memory of big memory users in BlockBasedTable in global memory limit (#9748) · 49623f9c
      Hui Xiao 提交于
      Summary:
      **Context:**
      Through heap profiling, we discovered that `BlockBasedTableReader` objects can accumulate and lead to high memory usage (e.g, `max_open_file = -1`). These memories are currently not saved, not tracked, not constrained and not cache evict-able. As a first step to improve this, similar to https://github.com/facebook/rocksdb/pull/8428,  this PR is to track an estimate of `BlockBasedTableReader` object's memory in block cache and fail future creation if the memory usage exceeds the available space of cache at the time of creation.
      
      **Summary:**
      - Approximate big memory users  (`BlockBasedTable::Rep` and `TableProperties` )' memory usage in addition to the existing estimated ones (filter block/index block/un-compression dictionary)
      - Charge all of these memory usages to block cache on `BlockBasedTable::Open()` and release them on `~BlockBasedTable()` as there is no memory usage fluctuation of concern in between
      - Refactor on CacheReservationManager (and its call-sites) to add concurrent support for BlockBasedTable  used in this PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9748
      
      Test Plan:
      - New unit tests
      - db bench: `OpenDb` : **-0.52% in ms**
        - Setup `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -write_buffer_size=1048576`
        - Repeated run with pre-change w/o feature and post-change with feature, benchmark `OpenDb`:  `./db_bench -benchmarks=readrandom -use_existing_db=1 -db=/dev/shm/testdb -reserve_table_reader_memory=true (remove this when running w/o feature) -file_opening_threads=3 -open_files=-1 -report_open_timing=true| egrep 'OpenDb:'`
      
      #-run | (feature-off) avg milliseconds | std milliseconds | (feature-on) avg milliseconds | std milliseconds | change (%)
      -- | -- | -- | -- | -- | --
      10 | 11.4018 | 5.95173 | 9.47788 | 1.57538 | -16.87382694
      20 | 9.23746 | 0.841053 | 9.32377 | 1.14074 | 0.9343477536
      40 | 9.0876 | 0.671129 | 9.35053 | 1.11713 | 2.893283155
      80 | 9.72514 | 2.28459 | 9.52013 | 1.0894 | -2.108041632
      160 | 9.74677 | 0.991234 | 9.84743 | 1.73396 | 1.032752389
      320 | 10.7297 | 5.11555 | 10.547 | 1.97692 | **-1.70275031**
      640 | 11.7092 | 2.36565 | 11.7869 | 2.69377 | **0.6635807741**
      
      -  db bench on write with cost to cache in WriteBufferManager (just in case this PR's CRM refactoring accidentally slows down anything in WBM) : `fillseq` : **+0.54% in micros/op**
      `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -cost_write_buffer_to_cache=true -write_buffer_size=10000000000 | egrep 'fillseq'`
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  avg micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 6.15 | 0.260187 | 6.289 | 0.371192 | 2.260162602
      20 | 7.28025 | 0.465402 | 7.37255 | 0.451256 | 1.267813605
      40 | 7.06312 | 0.490654 | 7.13803 | 0.478676 | **1.060579461**
      80 | 7.14035 | 0.972831 | 7.14196 | 0.92971 | **0.02254791432**
      
      -  filter bench: `bloom filter`: **-0.78% in ms/key**
          - ` ./filter_bench -impl=2 -quick -reserve_table_builder_memory=true | grep 'Build avg'`
      
      #-run | (pre-PR) avg ns/key | std ns/key | (post-PR)  ns/key | std ns/key | change (%)
      -- | -- | -- | -- | -- | --
      10 | 26.4369 | 0.442182 | 26.3273 | 0.422919 | **-0.4145720565**
      20 | 26.4451 | 0.592787 | 26.1419 | 0.62451 | **-1.1465262**
      
      - Crash test `python3 tools/db_crashtest.py blackbox --reserve_table_reader_memory=1 --cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D35136549
      
      Pulled By: hx235
      
      fbshipit-source-id: 146978858d0f900f43f4eb09bfd3e83195e3be28
      49623f9c
  32. 02 4月, 2022 1 次提交