1. 13 8月, 2022 4 次提交
    • S
      Fix two extra headers (#10525) · bc575c61
      sdong 提交于
      Summary:
      Fix copyright for two more extra headers to make internal tool happy.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38661390
      
      fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94
      bc575c61
    • C
      Add memtable per key-value checksum (#10281) · fd165c86
      Changyu Bi 提交于
      Summary:
      Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281
      
      Test Plan:
      - Added new unit test cases: `make check`
      - Benchmark on memtable insert
      ```
      TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100
      
      # avg over 10 runs
      Baseline: 1166936 ops/sec
      memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
      memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
      write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
      ```
      -  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
      ops/sec for `readrandom` dropped ~6.8%.
      ```
      # Readseq
      sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100
      
      readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
      readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec
      
      With -memtable_protection_bytes_per_key=2:
      
      readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
      readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec
      
      # Readrandom
      sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
      readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
      readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec
      
      With -memtable_protection_bytes_per_key=2:
      readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
      readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
      ```
      
      - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`
      
      Reviewed By: ajkr
      
      Differential Revision: D37607896
      
      Pulled By: cbi42
      
      fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
      fd165c86
    • P
      Derive cache keys from SST unique IDs (#10394) · 86a1e3e0
      Peter Dillinger 提交于
      Summary:
      ... so that cache keys can be derived from DB manifest data
      before reading the file from storage--so that every part of the file
      can potentially go in a persistent cache.
      
      See updated comments in cache_key.cc for technical details. Importantly,
      the new cache key encoding uses some fancy but efficient math to pack
      data into the cache key without depending on the sizes of the various
      pieces. This simplifies some existing code creating cache keys, like
      cache warming before the file size is known.
      
      This should provide us an essentially permanent mapping between SST
      unique IDs and base cache keys, with the ability to "upgrade" SST
      unique IDs (and thus cache keys) with new SST format_versions.
      
      These cache keys are of similar, perhaps indistinguishable quality to
      the previous generation. Before this change (see "corrected" days
      between collision):
      
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=43
      18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
      ```
      
      After this change (keep 43 bits, up through 50, to validate "trajectory"
      is ok on "corrected" days between collision):
      ```
      19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
      16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
      15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
      15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
      15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
      15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
      15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
      15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
      ```
      
      However, the change does modify (probably weaken) the "guaranteed unique" promise from this
      
      > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
      
      to this (see https://github.com/facebook/rocksdb/issues/10388)
      
      > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
      
      I don't think this is a practical concern, though.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
      
      Test Plan: unit tests updated, see simulation results above
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38667529
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
      86a1e3e0
    • P
      LOG more info on oldest snapshot and sequence numbers (#10454) · 9fa5c146
      Peter Dillinger 提交于
      Summary:
      The info LOG file does not currently give any direct
      information about the existence of old, live snapshots, nor how to
      estimate wall time from a sequence number within the scope of LOG
      history. This change addresses both with:
      * Logging smallest and largest seqnos for generated SST files, which can
      help associate sequence numbers with write time (based on flushes).
      * Logging oldest_snapshot_seqno for each compaction, which (along with
      that seqno info) helps us to determine how much old data might be kept
      around for old (leaked?) snapshots. Including the date here I thought might
      be excessive.
      
      I wanted to log the date and seqno of the oldest snapshot with periodic
      stats, but the current structure of the code doesn't really support that
      because `DumpDBStats` doesn't have access to the DB object.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454
      
      Test Plan:
      manual inspect LOG from
      `KEEP_DB=1 ./db_basic_test --gtest_filter=*CompactBetweenSnapshots*`
      
      Reviewed By: ajkr
      
      Differential Revision: D38326948
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038
      9fa5c146
  2. 12 8月, 2022 5 次提交
  3. 11 8月, 2022 3 次提交
    • J
      Migrate to docker for CI run (#10496) · 5d3aefb6
      Jay Zhuang 提交于
      Summary:
      Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
      Added the `Dockerfile` which is used to build the image.
      The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
      Also fixed a few issues detected while building this:
      * `DestoryDB()` Status not checked for a few tests
      * nullptr might be used in `inlineskiplist.cc`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496
      
      Test Plan: CI
      
      Reviewed By: ajkr
      
      Differential Revision: D38554200
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
      5d3aefb6
    • G
      Enable ClockCache in DB block cache test (#10482) · a0798f6f
      Guido Tagliavini Ponce 提交于
      Summary:
      A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: anand1976
      
      Differential Revision: D38553073
      
      Pulled By: guidotag
      
      fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae
      a0798f6f
    • S
      WritableFileWriter tries to skip operations after failure (#10489) · 911c0208
      sdong 提交于
      Summary:
      A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489
      
      Test Plan: Fix existing unit tests and watch crash tests for a while.
      
      Reviewed By: anand1976
      
      Differential Revision: D38473277
      
      fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
      911c0208
  4. 10 8月, 2022 5 次提交
    • G
      Revert "Add CompressedSecondaryCache into stress test" #10442 (#10509) · b57155a0
      gitbw95 提交于
      Summary:
      Revert https://github.com/facebook/rocksdb/pull/10442 before I find the root cause and fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache`.
      
      Memory leak is reported during crash tests and one example is shown as follows:
      ```
      ==70722==ERROR: LeakSanitizer: detected memory leaks
      
      Direct leak of 6648240 byte(s) in 83103 object(s) allocated from:
          #0 0x13de9d7 in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dbgo/gen/aab7ed39/internal_repo_rocksdb/repo/db_stress+0x13de9d7)
          https://github.com/facebook/rocksdb/issues/1 0x9084c7 in rocksdb::BlocklikeTraits<rocksdb::Block>::Create(rocksdb::BlockContents&&, unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*) internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:128
          https://github.com/facebook/rocksdb/issues/2 0x9084c7 in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)::operator()(void const*, unsigned long, void**, unsigned long*) const internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:34
          https://github.com/facebook/rocksdb/issues/3 0x9082c9 in rocksdb::Block std::__invoke_impl<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::__invoke_other, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61
          https://github.com/facebook/rocksdb/issues/4 0x90825d in std::enable_if<is_invocable_r_v<rocksdb::Block, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>, rocksdb::Block>::type std::__invoke_r<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:114
          https://github.com/facebook/rocksdb/issues/5 0x9081b0 in std::_Function_handler<rocksdb::Status (void const*, unsigned long, void**, unsigned long*), std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)>::_M_invoke(std::_Any_data const&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:291
          https://github.com/facebook/rocksdb/issues/6 0x991f2c in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)>::operator()(void const*, unsigned long, void**, unsigned long*) const third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:560
          https://github.com/facebook/rocksdb/issues/7 0x990277 in rocksdb::CompressedSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/cache/compressed_secondary_cache.cc:77
          https://github.com/facebook/rocksdb/issues/8 0xd3aa4d in rocksdb::FaultInjectionSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/utilities/fault_injection_secondary_cache.cc:92
          https://github.com/facebook/rocksdb/issues/9 0xeadaab in rocksdb::lru_cache::LRUCacheShard::Lookup(rocksdb::Slice const&, unsigned int, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/lru_cache.cc:445
          https://github.com/facebook/rocksdb/issues/10 0x1064573 in rocksdb::ShardedCache::Lookup(rocksdb::Slice const&, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/sharded_cache.cc:89
          https://github.com/facebook/rocksdb/issues/11 0x8be0df in rocksdb::BlockBasedTable::GetEntryFromCache(rocksdb::CacheTier const&, rocksdb::Cache*, rocksdb::Slice const&, rocksdb::BlockType, bool, rocksdb::GetContext*, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:389
          https://github.com/facebook/rocksdb/issues/12 0x905790 in rocksdb::Status rocksdb::BlockBasedTable::GetDataBlockFromCache<rocksdb::Block>(rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::UncompressionDict const&, rocksdb::BlockType, bool, rocksdb::GetContext*) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1263
          https://github.com/facebook/rocksdb/issues/13 0x8b9259 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1559
          https://github.com/facebook/rocksdb/issues/14 0x8b710c in rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1726
          https://github.com/facebook/rocksdb/issues/15 0x8c329f in rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::FilePrefetchBuffer*, bool, bool, rocksdb::Status&) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader_impl.h:58
          https://github.com/facebook/rocksdb/issues/16 0x920117 in rocksdb::BlockBasedTableIterator::InitDataBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:262
          https://github.com/facebook/rocksdb/issues/17 0x920d42 in rocksdb::BlockBasedTableIterator::MaterializeCurrentBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:332
          https://github.com/facebook/rocksdb/issues/18 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
          https://github.com/facebook/rocksdb/issues/19 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
          https://github.com/facebook/rocksdb/issues/20 0xef9f6c in rocksdb::MergingIterator::PrepareValue() internal_repo_rocksdb/repo/table/merging_iterator.cc:260
          https://github.com/facebook/rocksdb/issues/21 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
          https://github.com/facebook/rocksdb/issues/22 0xc67bcd in rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:326
          https://github.com/facebook/rocksdb/issues/23 0xc66d36 in rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:234
          https://github.com/facebook/rocksdb/issues/24 0xc7ab47 in rocksdb::DBIter::Next() internal_repo_rocksdb/repo/db/db_iter.cc:161
          https://github.com/facebook/rocksdb/issues/25 0x70d938 in rocksdb::BatchedOpsStressTest::TestPrefixScan(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) internal_repo_rocksdb/repo/db_stress_tool/batched_ops_stress.cc:320
          https://github.com/facebook/rocksdb/issues/26 0x6dc6a8 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:907
          https://github.com/facebook/rocksdb/issues/27 0x6867de in rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33
          https://github.com/facebook/rocksdb/issues/28 0xce4cc2 in rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:461
          https://github.com/facebook/rocksdb/issues/29 0x7f23f9068c0e in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10509
      
      Test Plan:
      ```
      $COMPILE_WITH_ASAN=1  make -j 24
      $db_stress J=40 crash_test_with_txn
      ```
      
      Reviewed By: siying
      
      Differential Revision: D38540648
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 703948e3a7ba40828a6445d00f3e73c184e34bf7
      b57155a0
    • Y
      Include minimal contextual information in `CompactionIterator` (#10505) · fee2c472
      Yanqin Jin 提交于
      Summary:
      The main purpose is to make debugging easier without sacrificing performance.
      
      Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`,
      using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where
      the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches
      `PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505
      
      Test Plan:
      make check
      ```
      TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush
      ```
      The above command has a 'flush' benchmark which uses `CompactionIterator`.  I haven't observed any CPU regression or drop in throughput or latency increase.
      
      Reviewed By: ltamasi
      
      Differential Revision: D38551615
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0
      fee2c472
    • G
      Fix the segdefault bug in CompressedSecondaryCache and its tests (#10507) · f060b47e
      gitbw95 提交于
      Summary:
      This fix is to replace `AllocateBlock()` with `new`. Once I figure out why `AllocateBlock()` might cause the segfault, I will update the implementation.
      
      Fix the bug that causes ./compressed_secondary_cache_test output following test failures:
      
      ```
      Note: Google Test filter = CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from CompressedSecondaryCacheTest
      [ RUN      ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      [       OK ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest (1 ms)
      [----------] 1 test from CompressedSecondaryCacheTest (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (9 ms total)
      [  PASSED  ] 1 test.
      t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.MergeChunksIntoValueTest: line 4: 1091086 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.MergeChunksIntoValueTest
      Note: Google Test filter = CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from CompressedSecondaryCacheTest
      [ RUN      ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      [       OK ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression (1 ms)
      [----------] 1 test from CompressedSecondaryCacheTest (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (2 ms total)
      [  PASSED  ] 1 test.
      t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression: line 4: 1090883 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
      
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10507
      
      Test Plan:
      Test 1:
      ```
      $make -j 24
      $./compressed_secondary_cache_test
      ```
      Test 2:
      ```
      $COMPILE_WITH_ASAN=1  make -j 24
      $./compressed_secondary_cache_test
      ```
      Test 3:
      ```
      $COMPILE_WITH_TSAN=1 make -j 24
      $./compressed_secondary_cache_test
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D38529885
      
      Pulled By: gitbw95
      
      fbshipit-source-id: d903fa3fadbd4d29f9528728c63a4f61c4396890
      f060b47e
    • A
      Fix MultiGet range deletion handling and a memory leak (#10513) · 0b02960d
      anand76 提交于
      Summary:
      This PR fixes 2 bugs introduced in https://github.com/facebook/rocksdb/issues/10432 -
      1. If the bloom filter returned a negative result for all MultiGet keys in a file, the range tombstones in that file were being ignored, resulting in incorrect results if those tombstones covered a key in a higher level.
      2. If all the keys in a file were filtered out in `TableCache::MultiGetFilter`, the table cache handle was not being released.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10513
      
      Test Plan: Add a new unit test that fails without this fix
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38548739
      
      Pulled By: anand1976
      
      fbshipit-source-id: a741a1e25d2e991d63f038100f126c2dc404a87c
      0b02960d
    • L
      Reset blob value as soon as it's not needed in DBIter (#10490) · 06b04127
      Levi Tamasi 提交于
      Summary:
      We have recently added caching support to BlobDB, and separately,
      implemented an optimization where reading blobs from the cache
      results in the cache handle being transferred to the target `PinnableSlice`
      (as opposed to the contents getting copied). With these changes,
      it makes sense to reset the `PinnableSlice` storing the blob value in
      `DBIter` as soon as we move to a different iterator position to prevent
      us from holding on to the cache handle any longer than necessary.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38473630
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386
      06b04127
  5. 09 8月, 2022 7 次提交
  6. 08 8月, 2022 1 次提交
  7. 06 8月, 2022 6 次提交
    • B
      Remove local static string (#8103) · e446bc65
      Burton Li 提交于
      Summary:
      Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier.
      
      P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103
      
      Reviewed By: ajkr
      
      Differential Revision: D38477235
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9
      e446bc65
    • A
      Close the Logger before rolling to next one in AutoRollLogger (#10488) · ce370d6b
      Akanksha Mahajan 提交于
      Summary:
      Close the existing logger first to release the existing
      handle before renaming the file using the file system.
      Since `AutoRollLogger::Flush` pinned down the `logger_`, `logger_` can't be closed unless its
      the last reference otherwise it  gives seg fault during Flush on file
      that has been closed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10488
      
      Test Plan: CircleCI jobs
      
      Reviewed By: ajkr
      
      Differential Revision: D38469249
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: dfbdb89b4ac37639aefcc503526f24753445fd3f
      ce370d6b
    • S
      Include some legal contents in website (#10491) · 2259bb9c
      sdong 提交于
      Summary:
      We are asked to include TOS, Privacy Policy and copyright in the website. Added it.
      Also changed the github and twitter link to RocksDB's rather than Facebook Open Source's and link to Meta open source's home page.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10491
      
      Test Plan: Test the website locally.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38475212
      
      fbshipit-source-id: f73622f8f3d361b4586221ffb6deac4f4a11bb15
      2259bb9c
    • J
      Re-enable SuggestCompactRangeTest and add Universal Compaction test (#10473) · edae671c
      Jay Zhuang 提交于
      Summary:
      The feature `SuggestCompactRange()` is still experimental. Just
      re-add the test back.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473
      
      Test Plan: CI
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38427153
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a
      edae671c
    • H
      Deflake ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 (#10481) · 56dbcb4f
      Hui Xiao 提交于
      Summary:
      **Context/summary:**
      `ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 ` relies on `DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles` happens before verifying `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(),
                    1 * CacheReservationManagerImpl<
                            CacheEntryRole::kFileMetadata>::GetDummyEntrySize());` or `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(), 0);` to ensure appropriate cache reservation release is done before checking.
      
      However, this might not be the case under some timing delay and spurious wake-up as coerced below.
      
      ```
       diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
      index 4378f3212..3e4f60853 100644
       --- a/db/db_impl/db_impl_compaction_flush.cc
      +++ b/db/db_impl/db_impl_compaction_flush.cc
      @@ -2989,6 +2989,8 @@ void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
           if (job_context.HaveSomethingToClean() ||
               job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
             mutex_.Unlock();
      +      bg_cv_.SignalAll();
      +      usleep(1000);
               // Have to flush the info logs before bg_compaction_scheduled_--
              // because if bg_flush_scheduled_ becomes 0 and the lock is
              // released, the deconstructor of DB can kick in and destroy all the
              // states of DB so info_log might not be available after that point.
              // It also applies to access other states that DB owns.
              log_buffer.FlushBufferToLog();
              if (job_context.HaveSomethingToDelete()) {
                PurgeObsoleteFiles(job_context);
                TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles");
              }
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10481
      
      Test Plan:
      The test of interest failed often at the above coercion:
      
      After fix, the test of interest passed at the above coercion:
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38438256
      
      Pulled By: hx235
      
      fbshipit-source-id: de80ecdb250174f00e7c2f5e4d952695ed56f51e
      56dbcb4f
    • C
      Fragment memtable range tombstone in the write path (#10380) · 9d77bf8f
      Changyu Bi 提交于
      Summary:
      - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
      - db_bench is updated to print out the number of range deletions executed if there is any.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
      
      Test Plan:
      - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
      - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.
      
      ```
      single thread:
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
      
      multi_thread
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
      ```
      Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
      Results are averaged over 5 runs.
      
      Single thread result:
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
      | 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
      | 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
      | 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
      | 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |
      
      32-thread result: note that "Max # tombstones" is per thread.
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
      | 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
      | 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
      | 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |
      
      Reviewed By: ajkr
      
      Differential Revision: D37916564
      
      Pulled By: cbi42
      
      fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
      9d77bf8f
  8. 05 8月, 2022 3 次提交
    • B
      Fix data race reported on SetIsInSecondaryCache in LRUCache (#10472) · f28d0c20
      Bo Wang 提交于
      Summary:
      Currently, `SetIsInSecondaryCache` is after `Promote`. After `Promote`, a handle can be accessed and its flags can be set. This causes data race.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10472
      
      Test Plan:
      unit tests
      stress tests
      
      Reviewed By: pdillinger
      
      Differential Revision: D38403991
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0aaa2d2edeaf5bc799fcce605648fe49eb7119c2
      f28d0c20
    • A
      Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb
      anand76 提交于
      Summary:
      This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38245910
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
      bf4532eb
    • Y
      Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456) · 538df26f
      Yanqin Jin 提交于
      Summary:
      Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies
      on a specific interleaving of two background flush threads. We call them
      bg1 and bg2, and assume bg1 starts to install flush results ahead of
      bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also
      enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can
      proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()`
      and needed to commit to the MANIFEST but falls behind bg1, then bg2
      needs to wait for bg1 to finish writing to MANIFEST. This is a circular
      dependency.
      
      Fix this by allowing bg2 to start only after bg1 grabs the chance to
      sync the WAL and commit to MANIFEST.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456
      
      Test Plan:
      1. make check
      
      2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion
      
      Reviewed By: ltamasi
      
      Differential Revision: D38391856
      
      Pulled By: riversand963
      
      fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96
      538df26f
  9. 04 8月, 2022 2 次提交
    • A
      Avoid allocations/copies for large `GetMergeOperands()` results (#10458) · 504fe4de
      Andrew Kryczka 提交于
      Summary:
      This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.
      
      The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458
      
      Test Plan:
      - new DB level test
      - measured benefit/regression in a number of memtable scenarios
      
      Setup command:
      ```
      $ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
      ```
      
      Benchmark command:
      ```
      ./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
      ```
      
      Worst regression is when a key has many tiny operands:
      
      - Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
      - `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)
      
      The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.
      
      Best improvement is when a key has large operands and high concurrency:
      
      - Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
      - `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).
      
      Reviewed By: cbi42
      
      Differential Revision: D38336578
      
      Pulled By: ajkr
      
      fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
      504fe4de
    • Q
      Fix the error path of PLUGIN_ROOT (#10446) · d23752f6
      Qiaolin Yu 提交于
      Summary:
      When we try to use RocksDB with plugins as a third-party library for other databases, the plugin folder cannot be compiled correctly because of the wrong PLUGIN_ROOT variable. So we fix this error to ensure that it works perfectly when the directory of RocksDB is not the root directory.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10446
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38371321
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0801b7b7dfa87751c8332fb52aac569dcdd72b5d
      Co-authored-by: NSuperMT <supertempler@gmail.com>
      d23752f6
  10. 03 8月, 2022 4 次提交
    • V
      increase buffer size in PosixFileSystem::GetAbsolutePath to PATH_MAX (#10413) · 8d664ccb
      Vladimir Kikhtenko 提交于
      Summary:
      RocksDB fails to open database with relative path when length of cwd
      is longer than 256 bytes. This happens due to ERANGE in getcwd call.
      Here we simply increase buffer size to the most common PATH_MAX value.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10413
      
      Reviewed By: riversand963
      
      Differential Revision: D38189254
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8a0d3a78bbe87645499fbf29fb12bd3d04cd4657
      8d664ccb
    • B
      Split cache to minimize internal fragmentation (#10287) · 87b82f28
      Bo Wang 提交于
      Summary:
      ### **Summary:**
      To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.
      
      _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._
      
      To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.
      
      **T1**
      Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
      Values:     100 bytes each (50 bytes after compression)
      Entries:    90000000
      RawSize:    9956.4 MB (estimated)
      FileSize:   5664.8 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T1_3 | 4000 | 4000 |
      | T1_4 | 2000 | 3000 |
      
      Populate the DB:
      ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1
      
      Run read tests with differnt cache setting:
      T1_3:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &
      
      T1_4:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &
      
      For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
      | - | - | - | - | - |
      | allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
      | available (MB)  | 8753 | 8092 | 5536 | 5051 |
      | external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
      | resident (MB)  | 8956 | 8365 | 5655 | 5235 |
      
      **T2**
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     256 bytes each (128 bytes after compression)
      Entries:    40000000
      RawSize:    10986.3 MB (estimated)
      FileSize:   6103.5 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T2_3 | 4000 | 4000 |
      | T2_4 | 2000 | 3000 |
      
      Create DB (10GB):
      ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      
      Run read tests with differnt cache setting:
      T2_3:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &
      
      T2_4:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &
      
      For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
      | -  | - | - | - | - |
      | allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
      | available (MB)  | 8489 | 8138 | 5435 | 5158 |
      | external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
      | resident (MB)  | 8676 | 8392 | 5541 | 5321 |
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287
      
      Test Plan: Unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D37743362
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad
      87b82f28
    • M
      Fix race in ExitAsBatchGroupLeader with pipelined writes (#9944) · bef3127b
      mpoeter 提交于
      Summary:
      Resolves https://github.com/facebook/rocksdb/issues/9692
      
      This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix.
      
      The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944
      
      Reviewed By: pdillinger
      
      Differential Revision: D36134604
      
      Pulled By: riversand963
      
      fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a
      bef3127b
    • P
      Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59
      Peter Dillinger 提交于
      Summary:
      TL;DR: due to a recent change, if you drop a column family,
      often that DB will no longer fsync after writing new SST files
      to remaining or new column families, which could lead to data
      loss on power loss.
      
      More bug detail:
      The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
      DB::Close time rather than waiting for DB object destruction.
      Unfortunately, it also closes shared FSDirectory objects on
      DropColumnFamily (& destroy remaining handles), which can lead
      to use-after-Close on FSDirectory shared with remaining column
      families. Those "uses" are only Fsyncs (or redundant Closes). In
      the default Posix filesystem, an Fsync on a closed FSDirectory is a
      quiet no-op. Consequently (under most configurations), if you drop
      a column family, that DB will no longer fsync after writing new SST
      files to column families sharing the same directory (true under most
      configurations).
      
      More fix detail:
      Basically, this removes unnecessary Close ops on destroying
      ColumnFamilyData. We let `shared_ptr` take care of calling the
      destructor at the right time. If the intent was to require Close be
      called before destroying FSDirectory, that was not made clear by the
      author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
      could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
      not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
      timely destruction of FSDirectory to suffice as Close (in
      CountedFileSystem). Added a TODO to revisit that.
      
      Also in this PR:
      * Added a TODO to share FSDirectory instances between DB and its column
      families. (Already shared among column families.)
      * Made DB::Close attempt to close all its open FSDirectory objects even
      if there is a failure in closing one. Also code clean-up around this
      logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
      
      Test Plan:
      add an assert to check for use-after-Close. With that
      existing tests can detect the misuse. With fix, tests pass (except noted
      relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
      
      Reviewed By: ajkr
      
      Differential Revision: D38357922
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
      27f3af59