1. 06 6月, 2023 1 次提交
    • C
      Drop range tombstone during non-bottommost compaction (#11459) · 4aa52d89
      Changyu Bi 提交于
      Summary:
      Similar to point tombstones, we can drop a range tombstone during compaction when we know its range does not exist in any higher level. This PR adds this optimization. Some existing test in db_range_del_test is fixed to work under this optimization.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11459
      
      Test Plan:
      * Add unit test `DBRangeDelTest, NonBottommostCompactionDropRangetombstone`.
      * Ran crash test that issues range deletion for a few hours: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --delrangepercent=10 --writepercent=31 --readpercent=40`
      
      Reviewed By: ajkr
      
      Differential Revision: D46007904
      
      Pulled By: cbi42
      
      fbshipit-source-id: 3f37205b6778b7d55ed106369ca41b0632a6d0fd
      4aa52d89
  2. 03 6月, 2023 2 次提交
    • A
      Small improvements to DBGet microbenchmark (#11498) · 687a2a0d
      Andrew Kryczka 提交于
      Summary:
      Follow a couple best practices:
      
      - Allowed Google benchmark to decide number of iterations. Previously we hardcoded a value, which circumvented benchmark's heuristic for iterating until the result is stable.
      - Made each iteration do similar work. Previously, an iteration could do different work depending if the key was found in the first, second, third, or no L0 file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11498
      
      Test Plan: none as I am unable to prove it is better
      
      Reviewed By: hx235
      
      Differential Revision: D46339050
      
      Pulled By: ajkr
      
      fbshipit-source-id: fcfc6da4111c5b3ae86d79d908afc5f61f96675b
      687a2a0d
    • P
      Some fixes to unreleased_history/ (#11504) · 7a9b264f
      Peter Dillinger 提交于
      Summary:
      * Add a "Performance Improvements" section
      * Add required copyright headers
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11504
      
      Test Plan: manual
      
      Reviewed By: hx235
      
      Differential Revision: D46405128
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4f878dfd0170d381d3051a44c13479c860e812c0
      7a9b264f
  3. 02 6月, 2023 4 次提交
    • C
      Log correct compaction score for Universal Compaction (#11487) · 71ca9a1d
      Changyu Bi 提交于
      Summary:
      currently 0 is incorrectly logged as the compaction score for L0 when num_levels > 1. This PR fixes the issue to log the correct score.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11487
      
      Test Plan:
      ```
      ./db_bench --benchmarks=fillrandom --max_background_jobs=8 --num=1000000  --compaction_style=1 --stats_dump_period_sec=20 --num_levels=7 --write_buffer_size=1048576
      
      grep "L0   " /tmp/rocksdbtest-543376/dbbench/LOG
      
      before:
      ** Compaction Stats [default] **
      Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
      ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      9.9      0.42              0.33         9    0.046       0      0       0.0       0.0
      L0      3/1    1.37 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.1      0.6      9.6      3.76              3.03        76    0.050     34K    140       0.0       0.0
      L0      2/0    2.26 MB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.6      3.2      8.2     12.59             11.17       163    0.077    619K   5499       0.0       0.0
      
      after: compaction scores are non-zero
      L0      0/0    0.00 KB   0.8      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      9.6      0.43              0.34         9    0.048       0      0       0.0       0.0
      L0      2/1   937.08 KB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   1.1      0.6      9.3      3.85              3.07        75    0.051     34K    165       0.0       0.0
      L0      2/2    1.82 MB   1.0      0.0     0.0      0.0       0.1      0.1       0.0   1.6      3.0      8.0     12.45             10.99       160    0.078    577K   5399       0.0       0.0
      
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D46293993
      
      Pulled By: cbi42
      
      fbshipit-source-id: 19753f7df68c5f54a84c4ed52794f83e510c9721
      71ca9a1d
    • C
      `CompactRange()` always compacts to bottommost level for leveled compaction (#11468) · e95cc121
      Changyu Bi 提交于
      Summary:
      currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: https://github.com/cbi42/rocksdb/blob/796f58f42ad1bdbf49e5fcf480763f11583b790e/db/db_impl/db_impl_compaction_flush.cc#L1180C29-L1184).
      
      This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction.
      
      Several unit tests are updated to work with this new manual compaction behavior.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11468
      
      Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*`
      
      Reviewed By: ajkr
      
      Differential Revision: D46079619
      
      Pulled By: cbi42
      
      fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f
      e95cc121
    • Y
      Add support to strip / pad timestamp when creating / reading a block based table (#11495) · 9f7877f2
      Yu Zhang 提交于
      Summary:
      Add support to strip timestamp in block based table builder and pad timestamp in block based table reader.
      
      On the write path, use the per column family option `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` to indicate whether user-defined timestamps should be stripped for all block based tables created for the column family.
      
      On the read path, added a per table `TableReadOption.user_defined_timestamps_persisted` to flag whether the user keys in the table contains user defined timestamps.
      
      This patch is mostly passing the related flags down to the block building/parsing level with the exception of handling the `first_internal_key` in `IndexValue`, which is included in the `IndexBuilder` level.  The value part of range deletion entries should have a similar handling, I haven't decided where to best fit this piece of logic, I will do it in a follow up.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11495
      
      Test Plan:
      Existing test `BlockBasedTableReaderTest` is parameterized to run with:
      1) different UDT test modes: kNone, kNormal, kStripUserDefinedTimestamp
      2) all four index types, when index type is `kTwoLevelIndexSearch`, also enables partitioned filters
      3) parallel vs non-parallel compression
      4) enable/disable compression dictionary.
      
      Also added tests for API `BlockBasedTableReader::NewIterator`.
      
      `PartitionedFilterBlockTest` is parameterized to run with different UDT test modes:kNone, kNormal, kStripUserDefinedTimestamp.
      
      ```
      make all check
      ./block_based_table_reader_test
      ./partitioned_filter_block_test
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D46344577
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 93ac8542b19319d1298712b8bed908c8831ba675
      9f7877f2
    • C
      Make `unreleased_history/release.sh` work on macOS (#11494) · 9f1ce6d8
      Changyu Bi 提交于
      Summary:
      I got the following errors when running `unreleased_history/release.sh` on my mac. This is due to mac does not have gnu version of awk and find by default. This PR updates the script to work on macOS.
      ```
      awk: calling undefined function strftime
       input record number 43, file
       source line number 4
      
      find: -regextype: unknown primary or operator
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11494
      
      Test Plan: manually run `DRY_RUN=1 unreleased_history/release.sh | less` on macOS and CentOS8 machines.
      
      Reviewed By: ajkr
      
      Differential Revision: D46328442
      
      Pulled By: cbi42
      
      fbshipit-source-id: a7570cd3480fcd25ac1438beb0d59fe655f9a71a
      9f1ce6d8
  4. 01 6月, 2023 2 次提交
    • shadowlux's avatar
      Support single delete help message in ldb (#11493) · 68a9cd21
      shadowlux 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11493
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D46325687
      
      Pulled By: ajkr
      
      fbshipit-source-id: ebf08477f5209104aee605496d751c857f4bb0a2
      68a9cd21
    • J
      Flush option in WaitForCompact() (#11483) · 87bc929d
      Jay Huh 提交于
      Summary:
      Context:
      
      As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `flush` option in `WaitForCompactOptions` to flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
      1. `bool flush = false` added to `WaitForCompactOptions`
      2. `DBImpl::FlushAllColumnFamilies()` is introduced and `DBImpl::FlushForGetLiveFiles()` is refactored to call it.
      3. `DBImpl::FlushAllColumnFamilies()` gets called before waiting in `WaitForCompact()` if `flush` option is `true`
      4. Some previous WaitForCompact tests were parameterized to include both cases for `abort_on_pause_` being true/false as well as `flush_` being true/false
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11483
      
      Test Plan:
      - `DBCompactionTest::WaitForCompactWithOptionToFlush` added
      - Changed existing DBCompactionTest::WaitForCompact tests to `DBCompactionWaitForCompactTest` to include params
      
      Reviewed By: pdillinger
      
      Differential Revision: D46289770
      
      Pulled By: jaykorean
      
      fbshipit-source-id: 70d3f461d96a6e06390be60170dd7c4d0d38f8b0
      87bc929d
  5. 31 5月, 2023 4 次提交
    • Y
      Logging timestamp size record in WAL and use it during recovery (#11471) · 56ca9e31
      Yu Zhang 提交于
      Summary:
      Start logging the timestamp size record in WAL and use the record during recovery.  Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471
      
      Test Plan:
      ```
      make all check
      ./db_secondary_test
      ./db_wal_test --gtest_filter="*WithTimestamp*"
      ./repair_test --gtest_filter="*WithTimestamp*"
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D46236769
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0
      56ca9e31
    • P
      Better management of unreleased HISTORY (#11481) · 8848ec92
      Peter Dillinger 提交于
      Summary:
      See new NOTE in HISTORY.md and unreleased_history/README.txt
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11481
      
      Test Plan: some manual testing on my CentOS 8 system
      
      Reviewed By: jaykorean
      
      Differential Revision: D46233342
      
      Pulled By: pdillinger
      
      fbshipit-source-id: daf59cf3dc907f450b469090dcc481a30a7d7c0d
      8848ec92
    • C
      Fix flaky test: `DBCompactionTest.WaitForCompactShutdownWhileWaiting` (#11488) · e1c7209b
      Changyu Bi 提交于
      Summary:
      tsan complains with the following error message. This is likely due to DB object destroyed while WaitForCompact() is still running.
      ```
      [ RUN      ] DBCompactionTest.WaitForCompactShutdownWhileWaiting
      ==================
      WARNING: ThreadSanitizer: data race (pid=1128703)
        Atomic read of size 1 at 0x7b8c00000740 by thread T4:
          #0 pthread_cond_wait <null> (db_compaction_test+0x46970a)
          https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::Wait() /root/project/port/port_posix.cc:119:23 (librocksdb.so.8.4+0x7c4c60)
          https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedCondVar::WaitInternal() /root/project/monitoring/instrumented_mutex.cc:69:9 (librocksdb.so.8.4+0x75f697)
          https://github.com/facebook/rocksdb/issues/3 rocksdb::InstrumentedCondVar::Wait() /root/project/monitoring/instrumented_mutex.cc:62:3 (librocksdb.so.8.4+0x75f697)
          https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::WaitForCompact(rocksdb::WaitForCompactOptions const&) /root/project/db/db_impl/db_impl_compaction_flush.cc:3978:14 (librocksdb.so.8.4+0x494174)
          https://github.com/facebook/rocksdb/issues/5 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30::operator()() const /root/project/db/db_compaction_test.cc:3479:26 (db_compaction_test+0x5cdc90)
          https://github.com/facebook/rocksdb/issues/6 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdc90)
          https://github.com/facebook/rocksdb/issues/7 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdc90)
          https://github.com/facebook/rocksdb/issues/8 void std::thread::_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdc90)
          https://github.com/facebook/rocksdb/issues/9 std::thread::_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdc90)
          https://github.com/facebook/rocksdb/issues/10 std::thread::_State_impl<std::thread::_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdc90)
          https://github.com/facebook/rocksdb/issues/11 <null> <null> (libstdc++.so.6+0xda6b3)
      
        Previous write of size 1 at 0x7b8c00000740 by thread T5:
          #0 pthread_mutex_destroy <null> (db_compaction_test+0x46a4f8)
          https://github.com/facebook/rocksdb/issues/1 rocksdb::port::Mutex::~Mutex() /root/project/port/port_posix.cc:77:48 (librocksdb.so.8.4+0x7c480e)
          https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedMutex::~InstrumentedMutex() /root/project/./monitoring/instrumented_mutex.h:20:7 (librocksdb.so.8.4+0x41fda6)
          https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:755:1 (librocksdb.so.8.4+0x41fda6)
          https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:737:19 (librocksdb.so.8.4+0x4203d9)
          https://github.com/facebook/rocksdb/issues/5 rocksdb::DBTestBase::Close() /root/project/db/db_test_util.cc:670:3 (librocksdb_test_debug.so+0x57413)
          https://github.com/facebook/rocksdb/issues/6 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31::operator()() const /root/project/db/db_compaction_test.cc:3485:49 (db_compaction_test+0x5cdf03)
          https://github.com/facebook/rocksdb/issues/7 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdf03)
          https://github.com/facebook/rocksdb/issues/8 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdf03)
          https://github.com/facebook/rocksdb/issues/9 void std::thread::_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdf03)
          https://github.com/facebook/rocksdb/issues/10 std::thread::_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdf03)
          https://github.com/facebook/rocksdb/issues/11 std::thread::_State_impl<std::thread::_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdf03)
          https://github.com/facebook/rocksdb/issues/12 <null> <null> (libstdc++.so.6+0xda6b3)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11488
      
      Test Plan:
      ```
      COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 db_compaction_test
      
      gtest-parallel --repeat=10000 ./db_compaction_test --gtest_filter="*WaitForCompactShutdownWhileWaiting*" -w200
      ```
      
      Reviewed By: jaykorean
      
      Differential Revision: D46293891
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8ca259cb1e09a9e4f4095b2d084f2ba92b710b97
      e1c7209b
    • A
      Integrate CacheReservationManager with compressed secondary cache (#11449) · fcc358ba
      anand76 提交于
      Summary:
      This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows.
      
      When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches.
      
      The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows -
        |---------    Primary Cache Configured Capacity  -----------|
        |---Secondary Cache Budget----|----Primary Cache Budget-----|
      
      A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired.
      
        |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---|
      
      When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation.
      
      For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion -
      
        |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-|
      
      Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount.
      
      Other alternatives -
      1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead.
      2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them.
      
      The PR contains the following changes -
      1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```.
      3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```.
      4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```.
      5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted.
      
      Benchmarks -
      Baseline
      ```
      time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true
      ```
      ```
      readseq      :       3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s
      readwhilewriting :      22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found)
      
      real    6m31.052s
      user    152m5.660s
      sys     26m18.738s
      ```
      With TieredVolatileCache
      ```
      time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true
      ```
      ```
      readseq      :       4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s
      readwhilewriting :      20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found)
      
      real    6m42.743s
      user    157m58.972s
      sys     33m16.671
      ```
      ```
      readseq      :       3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s
      readwhilewriting :      21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found)
      
      real    6m31.469s
      user    155m16.570s
      sys     27m47.834s
      ```
      
      ToDo -
      1. Add to db_stress
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449
      
      Reviewed By: pdillinger
      
      Differential Revision: D46197388
      
      Pulled By: anand1976
      
      fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
      fcc358ba
  6. 27 5月, 2023 2 次提交
    • A
      add WriteBatch::Release() (#11482) · 3e7fc881
      Andrew Kryczka 提交于
      Summary:
      Together with the existing constructor,
      `explicit WriteBatch(std::string&& rep)`, this enables transferring `WriteBatch` via its `std::string` representation. Associated info like KV checksums are dropped but the caller can use `WriteBatch::VerifyChecksum()` before taking ownership if needed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11482
      
      Reviewed By: cbi42
      
      Differential Revision: D46233884
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6bc64a6e75fb7bbf61d08c09520fc3705a7b44d8
      3e7fc881
    • S
      Tweak on IsTrivialMove() (#11467) · de1dd4ca
      Soli 提交于
      Summary:
      `output_level_` and `number_levels_` are not changing in iteration of `inputs_` files.
      
      Moving the check out of `for` loop could slightly improve performance.
      
      It is easier to review when ignore whitespace changes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11467
      
      Reviewed By: cbi42
      
      Differential Revision: D46155962
      
      Pulled By: ajkr
      
      fbshipit-source-id: 45ec80b13152b3bed7305e6f707cb9b187d5f315
      de1dd4ca
  7. 26 5月, 2023 4 次提交
    • J
      Move WaitForCompect() change entry to Unreleased in History file (#11479) · 23f4e9ad
      Jay Huh 提交于
      Summary:
      Context:
      
      Because of the branch cut, History change made it to the previous release. Moving entry to Unreleased
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11479
      
      Test Plan: History change. Not needed.
      
      Reviewed By: pdillinger
      
      Differential Revision: D46226237
      
      Pulled By: jaykorean
      
      fbshipit-source-id: 33e7d84a05db254d227f05d76038fc6d225dbabf
      23f4e9ad
    • J
      Add WaitForCompact with WaitForCompactOptions to public API (#11436) · 81aeb159
      Jay Huh 提交于
      Summary:
      Context:
      
      This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`
      
      1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
      2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.
      
      1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
      2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
      3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436
      
      Test Plan:
      Following tests added
      - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
      - `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
      - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
      - `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
      - `TransactionTest::WaitForCompactAbortOnPause`
      
      NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.
      
      Reviewed By: pdillinger
      
      Differential Revision: D45799659
      
      Pulled By: jaykorean
      
      fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
      81aeb159
    • Y
      Add support to strip / pad timestamp when writing / reading a block (#11472) · d1ae7f6c
      Yu Zhang 提交于
      Summary:
      This patch adds support in `BlockBuilder` to strip user-defined timestamp from the `key` added via `Add(key, value)` and its equivalent APIs. The stripping logic is different when the key is either a user key or an internal key, so the `BlockBuilder` is created with a flag to indicate that. This patch also add support on the read path to APIs `NewIndexIterator`, `NewDataIterator` to support pad a min timestamp.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11472
      
      Test Plan:
      Three test modes are added to parameterize existing tests:
      UserDefinedTimestampTestMode::kNone -> UDT feature is not enabled
      UserDefinedTimestampTestMode::kNormal -> UDT feature enabled, write / read with min timestamp
      UserDefinedTimestampTestMode::kStripUserDefinedTimestamps -> UDT feature enabled, write / read with min timestamp, set `persist_user_defined_timestamps` where it applies to false.
      The tests read/write with min timestamp so that point read and range scan can correctly read values in all three test modes.
      
      `block_test` are parameterized to run with above three test modes and some additional parameteriazation
      
      ```
      make all check
      ./block_test --gtest_filter="P/BlockTest*"
      ./block_test --gtest_filter="P/IndexBlockTest*"
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D46200539
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 59f5d6b584639976b69c2943eba723bd47d9b3c0
      d1ae7f6c
    • H
      Fix StopWatch bug; Remove setting `record_read_stats` (#11474) · dcc6fc99
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      - StopWatch enable stats even when `StatsLevel::kExceptTimers` is set. It's a harmless bug though since `reportTimeToHistogram()` will not report it anyway according to https://github.com/facebook/rocksdb/blob/main/include/rocksdb/statistics.h#L705
      -  https://github.com/facebook/rocksdb/pull/11288 should have removed logics of setting `record_read_stats = !for_compaction` as we don't differentiate `RandomAccessFileReader`'s stats behavior based on compaction or not (instead we now report stats of different IO activities including compaction to different stats). Fixing this should report more compaction related file read micros that aren't reported previously due to `for_compaction==true`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11474
      
      Test Plan:
      - DB bench pre vs post fix with small max_open_files
      
      Setup command
      `./db_ bench  -db=/dev/shm/testdb/ -statistics=true -benchmarks=fillseq -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3`
      
      Run command
      `./db_bench --open_files=1 -use_existing_db=true -db=/dev/shm/testdb2/ -statistics=true -benchmarks=compactall -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3`
      
      Pre-fix
      ```
      rocksdb.sst.read.micros P50 : 2.056175 P95 : 4.647739 P99 : 8.948475 P100 : 25.000000 COUNT : 4451 SUM : 12827
      rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.compaction.micros P50 : 2.057397 P95 : 4.625253 P99 : 8.749474 P100 : 25.000000 COUNT : 4382 SUM : 12608
      rocksdb.file.read.db.open.micros P50 : 1.985294 P95 : 9.100000 P99 : 13.000000 P100 : 13.000000 COUNT : 69 SUM : 219
      ```
      
      Post-fix (with a higher `rocksdb.file.read.compaction.micros` count)
      ```
      rocksdb.sst.read.micros P50 : 1.858968 P95 : 3.653086 P99 : 5.968000 P100 : 21.000000 COUNT : 3548 SUM : 9119
      rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.compaction.micros P50 : 1.857027 P95 : 3.627614 P99 : 5.738621 P100 : 21.000000 COUNT : 3479 SUM : 8904
      rocksdb.file.read.db.open.micros P50 : 2.000000 P95 : 6.733333 P99 : 11.000000 P100 : 11.000000 COUNT : 69 SUM : 215
      ```
      - CI
      
      Reviewed By: ajkr
      
      Differential Revision: D46137221
      
      Pulled By: hx235
      
      fbshipit-source-id: e5b4ee7001af26f2ee0377bc6334f43b2a527388
      dcc6fc99
  8. 25 5月, 2023 3 次提交
    • P
      Document SyncPoint::LoadDependency (#11477) · e8710303
      Peter Dillinger 提交于
      Summary:
      It's easy to mix up the ordering when it's undocumented. For an example of the meaning of the order, see DBTest.ThreadStatusFlush.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11477
      
      Test Plan: comments only
      
      Reviewed By: jaykorean
      
      Differential Revision: D46166683
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 33118ba7ef1b08eab7b077548fe2e70f2c309e3f
      e8710303
    • P
      Improve memory efficiency of many OptimisticTransactionDBs (#11439) · 17bc2774
      Peter Dillinger 提交于
      Summary:
      Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually!
      
      In this change:
      * Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway.
      * Allocate the mutexes in a contiguous array, for efficiency
      * Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose.
      * Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing)
      * Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439
      
      Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!)
      
      Reviewed By: ltamasi
      
      Differential Revision: D45796393
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54
      17bc2774
    • A
      Implement missing compactrangeoptions from Java API (#10880) · 93e0715f
      Alan Paxton 提交于
      Summary:
      Add the following missing options to `src/main/java/org/rocksdb/CompactRangeOptions.java` and in `java/rocksjni/options.cc` in RocksJava.
      
      For the descriptions and API see the C++ file `include/rocksdb/options.h`, specifically the struct `CompactRangeOptions`
      
      * full_history_ts_low
      * canceled
      
      We changed the handle to return an object (of class `Java_org_rocksdb_CompactRangeOptions`) containing a `ROCKSDB_NAMESPACE::CompactRangeOptions` at (almost certainly) 0-offset, rather than a raw `ROCKSDB_NAMESPACE::CompactRangeOptions`.
      
      The `Java_org_rocksdb_CompactRangeOptions` contains as supplementary fields objects (std::string, std::atomic<bool>) which are passed as pointers to the `ROCKSDB_NAMESPACE::CompactRangeOptions` and which must therefore live for as long as the `ROCKSDB_NAMESPACE::CompactRangeOptions`. By placing them in a `Java_org_rocksdb_CompactRangeOptions` we achieve this.
      
      Because the field offset of the `ROCKSDB_NAMESPACE::CompactRangeOptions` member is (very probably) 0, casting the handle to ROCKSDB_NAMESPACE::CompactRangeOptions works (i.e. old methods didn’t have to be changed), but really that’s a minefield and the correct answer is to cast to the correct type (Java_org_rocksdb_CompactRangeOptions) and then use the ROCKSDB_NAMESPACE::CompactRangeOptions field in that. So the get/set methods for existing parameters have this change.
      
      Testing
      -------
      We added unit tests for getting and setting the newly implemented fields to `CompactRangeOptionsTest`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10880
      
      Reviewed By: ajkr
      
      Differential Revision: D41482476
      
      Pulled By: anand1976
      
      fbshipit-source-id: c70795e790436fb3544655920adf6fca62ed34e2
      93e0715f
  9. 24 5月, 2023 2 次提交
  10. 23 5月, 2023 6 次提交
    • Y
      Fix stress test failure caused by #11424 (#11470) · 68cc429b
      Yu Zhang 提交于
      Summary:
      The `ryw_expected_values` check only applies to when transaction is used.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11470
      
      Reviewed By: hx235
      
      Differential Revision: D46085614
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 4757896c3a62975641adcf97db077a04a0f33030
      68cc429b
    • A
      Fix regression script for async_io benchmarks (#11462) · 53e0b2fe
      akankshamahajan 提交于
      Summary:
      Fix regression script for async_io benchmarks to report right ops/sec and latency
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11462
      
      Test Plan: Verified locally
      
      Reviewed By: anand1976
      
      Differential Revision: D46031147
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 33ba587e6569ab2f834381ac2538e61da6876405
      53e0b2fe
    • Y
      Add utils to use for handling user defined timestamp size record in WAL (#11451) · 11ebddb1
      Yu Zhang 提交于
      Summary:
      Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed.
      
      Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451
      
      Test Plan:
      ```
      make check
      ./udt_util_test
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D45894386
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d
      11ebddb1
    • Y
      Refactor WriteUnpreparedStressTest to be a unit test (#11424) · ffb5f1f4
      Yu Zhang 提交于
      Summary:
      This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424
      
      Test Plan:
      `./write_unprepared_transaction_test`
      `./db_crashtest.py whitebox --txn`
      `./db_crashtest.py blackbox --txn`
      
      Reviewed By: hx235
      
      Differential Revision: D45551521
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8
      ffb5f1f4
    • F
      fix typo in detecting HAVE_AUXV_GETAUXVAL (#10913) · 5b945adf
      FishAndBird 提交于
      Summary:
      crc32 uses CPU heavily,  arm64 and ppc will benefited by crc32 accelerate.
      
      Only build via `cmake` affected
      
      - Arm64 Tested ok, crc32 acceralated, write 30GB data throughput promoted 30%.
      - ppc not tested.
      - x86_64 seems not affected.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10913
      
      Reviewed By: pdillinger
      
      Differential Revision: D45959843
      
      Pulled By: ajkr
      
      fbshipit-source-id: 93c91f2702fec33cca69139a2544d7c5ebeac4c6
      5b945adf
    • A
      Repair/instate jemalloc build on M1 (#11257) · 6eb3770b
      Alan Paxton 提交于
      Summary:
      jemalloc was not building on M1 Macs. This makes it work.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11257
      
      Reviewed By: anand1976
      
      Differential Revision: D45959570
      
      Pulled By: ajkr
      
      fbshipit-source-id: 08c2b81b399f5003a2c159d037f9bcc5d0059556
      6eb3770b
  11. 20 5月, 2023 2 次提交
    • Y
      Update HISTORY.md/version.h/format compatiblity test for 8.3 release (#11464) · 509116c5
      Yu Zhang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11464
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D46041333
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 7d83cf9e611451fcc7f7e4a837681ed0d4271df4
      509116c5
    • P
      Much better stats for seeks and prefix filtering (#11460) · 39f5846e
      Peter Dillinger 提交于
      Summary:
      We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
      
      This change does several things:
      * Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
        * Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
        * We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
        * For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
        * The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
      * The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
      * Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
      
      Test Plan:
      unit tests updated, including updating many to pop the stat value since last read to improve test
      readability and maintainability.
      
      Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
      
      Create DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
      ```
      And run simultaneous before&after with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
      ```
      Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec;   18.4 (± 0.0) MB/sec
      After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec;   19.1 (± 0.0) MB/sec
      
      Reviewed By: ajkr
      
      Differential Revision: D46029177
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
      39f5846e
  12. 19 5月, 2023 4 次提交
    • P
      Compatibility step for separating BlockCache and GeneralCache APIs (#11450) · 4067acab
      Peter Dillinger 提交于
      Summary:
      Add two type aliases for Cache: BlockCache and GeneralCache, and add LRUCacheOptions::MakeSharedGeneralCache(). This will ease upgrade to an intended future change to separate the cache API between block cache and other (general) uses, including row cache. Separating the APIs will make it easier to expose more details of block caching for customization. For example, it would be nice to pass the file unique ID and offset as the logical cache key instead of using a Slice, which could facilitate some file-specific customizations in block cache. This would also make it clear that HyperClockCache is not usable as a general cache, because it can only deal with fixed-size block cache keys.
      
      block_cache, row_cache, and blob_cache are the uses of Cache in the public API. blob_cache should be able to use BlockCache while row_cache is a GeneralCache user, as its keys are of arbitrary size.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11450
      
      Test Plan: see updated unit test.
      
      Reviewed By: anand1976
      
      Differential Revision: D45882067
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ff5d9f0b644f87ae337a29a7728ce3ed07b2a4b2
      4067acab
    • M
      Support Clip DB to KeyRange (#11379) · 8d8eb0e7
      mayue.fight 提交于
      Summary:
      This PR is part of the request https://github.com/facebook/rocksdb/issues/11317.
      (Another part is https://github.com/facebook/rocksdb/pull/11378)
      
      ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones).
       This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs.
      
      When Calling ClipDB [begin, end), there are the following steps
      
      1.  Quickly and directly delete files without overlap
       DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr)
      2. Delete the Key outside the range
      Delete[smallest_key, begin) + Delete[end, largest_key]
      3. Delete the tombstone through Manul Compact
      CompactRange(option, nullptr, nullptr)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379
      
      Reviewed By: ajkr
      
      Differential Revision: D45840358
      
      Pulled By: cbi42
      
      fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4
      8d8eb0e7
    • H
      Improve comment of ExpectedValue in db stress (#11456) · 7263f51d
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      https://github.com/facebook/rocksdb/pull/11424 made me realize there are a couple gaps in my `ExpectedValue` comments so I updated them, along with separating `ExpectedValue` into separate files so it's clearer that `ExpectedValue` can be used without updating `ExpectedState` (e.g, TestMultiGet() where we care about value base of expected value but not updating the ExpectedState).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11456
      
      Test Plan: CI
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D45965070
      
      Pulled By: hx235
      
      fbshipit-source-id: dcee690c13b00a3119757ea9d43b646f9644e1a9
      7263f51d
    • H
      Add `rocksdb.file.read.db.open.micros` (#11455) · 50046869
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      `rocksdb.file.read.db.open.micros` is left out in https://github.com/facebook/rocksdb/pull/11288
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11455
      
      Test Plan:
      - db bench
      Setup: `./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`
      Run:
      `./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/  -benchmarks=readrandom  -key_size=3200 -value_size=512 -num=0 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none -file_checksum=1 -cache_size=1`
      
      ```
      rocksdb.sst.read.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
      rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
      rocksdb.file.read.db.open.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D45951934
      
      Pulled By: hx235
      
      fbshipit-source-id: 6c88639dc1b10d98ecccc963ce32a8800495f55b
      50046869
  13. 18 5月, 2023 3 次提交
    • A
      Minimal RocksJava compliance with Java 8 language level (EB 1046) (#10951) · e110d713
      Alan Paxton 提交于
      Summary:
      Apply a small (and automatic) set of IntelliJ Java inspections/repairs to the Java interface to RocksDB Java and its tests.
      Partly enabled by the fact that we now (from RocksDB7) require java 8.
      
      Explicit <p> in empty lines in javadoc comments.
      
      Parameters and variables made final where possible.
      Anonymous subclasses converted lambdas.
      
      Some tests which previously used other assertion models were converted to assertj, e.g. (assertThat(actual).isEqualTo(expected)
      
      In a very few cases tests were found to be inoperative or broken, and were repaired. No problems with actual RocksDB behaviour were observed.
      
      This PR is intended to replace https://github.com/facebook/rocksdb/pull/9618 - that PR was not merged, and attempts to rebase it have yielded a questionable looking diff, so we choose to go back to square 1 here, and implement a conservative set of changes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10951
      
      Reviewed By: anand1976
      
      Differential Revision: D45057849
      
      Pulled By: ajkr
      
      fbshipit-source-id: e4ea46bfc80518ae86f37702b03ca9352bc11c3d
      e110d713
    • J
      Remove wait_unscheduled from waitForCompact internal API (#11443) · 586d78b3
      Jay Huh 提交于
      Summary:
      Context:
      
      In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
      
      In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
      
      This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
      
      Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
      
      Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
      
      Test Plan:
      Existing tests that involve a shutdown in progress:
      
      - DBCompactionTest::CompactRangeShutdownWhileDelayed
      - DBTestWithParam::PreShutdownMultipleCompaction
      - DBTestWithParam::PreShutdownCompactionMiddle
      
      Reviewed By: pdillinger
      
      Differential Revision: D45923426
      
      Pulled By: jaykorean
      
      fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
      586d78b3
    • P
      Change internal headers with duplicate names (#11408) · 206fdea3
      Peter Dillinger 提交于
      Summary:
      In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.
      
      No public API changes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408
      
      Test Plan: existing tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D45456696
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
      206fdea3
  14. 16 5月, 2023 1 次提交
    • H
      Support parallel read and write/delete to same key in NonBatchedOpsStressTest (#11058) · 5fc57eec
      Hui Xiao 提交于
      Summary:
      **Context:**
      Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage.
      
      To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state.
      
      **Summary:**
      Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel.
      
      Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted).
      
      Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept.
      
      (For detailed algorithm of how to use these parts to infer expected state of a key, see the PR)
      
      Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058
      
      Test Plan:
      - Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min
      ```
      python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000
      ```
      - Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability)
      - [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key
      
      Reviewed By: ajkr
      
      Differential Revision: D42257258
      
      Pulled By: hx235
      
      fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb
      5fc57eec