1. 27 4月, 2022 4 次提交
    • P
      Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) · 9d0cae71
      Peter Dillinger 提交于
      Summary:
      When MultiGet() determines that multiple query keys can be
      served by examining the same data block in block cache (one Lookup()),
      each PinnableSlice referring to data in that data block needs to hold
      on to the block in cache so that they can be released at arbitrary
      times by the API user. Historically this is accomplished with extra
      calls to Ref() on the Handle from Lookup(), with each PinnableSlice
      cleanup calling Release() on the Handle, but this creates extra
      contention on the block cache for the extra Ref()s and Release()es,
      especially because they hit the same cache shard repeatedly.
      
      In the case of merge operands (possibly more cases?), the problem was
      compounded by doing an extra Ref()+eventual Release() for each merge
      operand for a key reusing a block (which could be the same key!), rather
      than one Ref() per key. (Note: the non-shared case with `biter` was
      already one per key.)
      
      This change optimizes MultiGet not to rely on these extra, contentious
      Ref()+Release() calls by instead, in the shared block case, wrapping
      the cache Release() cleanup in a refcounted object referenced by the
      PinnableSlices, such that after the last wrapped reference is released,
      the cache entry is Release()ed. Relaxed atomic refcounts should be
      much faster than mutex-guarded Ref() and Release(), and much less prone
      to a performance cliff when MultiGet() does a lot of block sharing.
      
      Note that I did not use std::shared_ptr, because that would require an
      extra indirection object (shared_ptr itself new/delete) in order to
      associate a ref increment/decrement with a Cleanable cleanup entry. (If
      I assumed it was the size of two pointers, I could do some hackery to
      make it work without the extra indirection, but that's too fragile.)
      
      Some details:
      * Fixed (removed) extra block cache tracing entries in cases of cache
      entry reuse in MultiGet, but it's likely that in some other cases traces
      are missing (XXX comment inserted)
      * Moved existing implementations for cleanable.h from iterator.cc to
      new cleanable.cc
      * Improved API comments on Cleanable
      * Added a public SharedCleanablePtr class to cleanable.h in case others
      could benefit from the same pattern (potentially many Cleanables and/or
      smart pointers referencing a shared Cleanable)
      * Add a typedef for MultiGetContext::Mask
      * Some variable renaming for clarity
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899
      
      Test Plan:
      Added unit tests for SharedCleanablePtr.
      
      Greatly enhanced ability of existing tests to detect cache use-after-free.
      * Release PinnableSlices from MultiGet as they are read rather than in
      bulk (in db_test_util wrapper).
      * In ASAN build, default to using a trivially small LRUCache for block_cache
      so that entries are immediately erased when unreferenced. (Updated two
      tests that depend on caching.) New ASAN testsuite running time seems
      OK to me.
      
      If I introduce a bug into my implementation where we skip the shared
      cleanups on block reuse, ASAN detects the bug in
      `db_basic_test *MultiGet*`. If I remove either of the above testing
      enhancements, the bug is not detected.
      
      Consider for follow-up work: manipulate or randomize ordering of
      PinnableSlice use and release from MultiGet db_test_util wrapper. But in
      typical cases, natural ordering gives pretty good functional coverage.
      
      Performance test:
      In the extreme (but possible) case of MultiGetting the same or adjacent keys
      in a batch, throughput can improve by an order of magnitude.
      `./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
      Before ops/sec, num=5: 1,384,394
      Before ops/sec, num=500: 6,423,720
      After ops/sec, num=500: 10,658,794
      After ops/sec, num=5: 16,027,257
      
      Also note that previously, with high parallelism, having query keys
      concentrated in a single block was worse than spreading them out a bit. Now
      concentrated in a single block is faster than spread out, which is hopefully
      consistent with natural expectation.
      
      Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
      Before: multireadrandom [AVG    999 runs] : 1088699 (± 7344) ops/sec;  120.4 (± 0.8 ) MB/sec
      After: multireadrandom [AVG    999 runs] : 1090402 (± 7230) ops/sec;  120.6 (± 0.8 ) MB/sec
      Possibly better, possibly in the noise.
      
      Reviewed By: anand1976
      
      Differential Revision: D35907003
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e
      9d0cae71
    • A
      fix clang-analyze in corruption_test (#9908) · ce2d8a42
      Andrew Kryczka 提交于
      Summary:
      This PR fixes a clang-analyze error that I introduced in https://github.com/facebook/rocksdb/issues/9906:
      
      ```
      db/corruption_test.cc:358:15: warning: Called C++ object pointer is null
          ASSERT_OK(db_->Put(WriteOptions(), cfhs[0], "k", "v"));
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ./test_util/testharness.h:76:62: note: expanded from macro 'ASSERT_OK'
        ASSERT_PRED_FORMAT1(ROCKSDB_NAMESPACE::test::AssertStatus, s)
                                                                   ^
      third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19909:36: note: expanded
      from macro 'ASSERT_PRED_FORMAT1'
        GTEST_PRED_FORMAT1_(pred_format, v1, GTEST_FATAL_FAILURE_)
                                         ^~
      third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19892:34: note: expanded
      from macro 'GTEST_PRED_FORMAT1_'
        GTEST_ASSERT_(pred_format(#v1, v1), \
                                       ^~
      third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19868:52: note: expanded
      from macro 'GTEST_ASSERT_'
        if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                         ^~~~~~~~~~
      1 warning generated.
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9908
      
      Reviewed By: riversand963
      
      Differential Revision: D35953147
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9b837bd7581c6e1e2cdbc961c099652256eb9d4b
      ce2d8a42
    • A
      Add mmap DBGet microbench parameters (#9903) · 1eb279dc
      Andrew Kryczka 提交于
      Summary:
      I tried evaluating https://github.com/facebook/rocksdb/issues/9611 using DBGet microbenchmarks but mostly found the change is well within the noise even for hundreds of repetitions; meanwhile, the InternalKeyComparator CPU it saves is 1-2% according to perf so it should be measurable. In this PR I tried adding a mmap mode that will bypass compression/checksum/block cache/file read to focus more on the block lookup paths, and also increased the Get() count.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9903
      
      Reviewed By: jay-zhuang, riversand963
      
      Differential Revision: D35907375
      
      Pulled By: ajkr
      
      fbshipit-source-id: 69490d5040ef0863e1ce296724104d0aa7667215
      1eb279dc
    • A
      Revert open logic changes in #9634 (#9906) · c5d367f4
      Andrew Kryczka 提交于
      Summary:
      Left HISTORY.md and unit tests.
      Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906
      
      Reviewed By: riversand963
      
      Differential Revision: D35940093
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b
      c5d367f4
  2. 26 4月, 2022 4 次提交
    • A
      Add stats related to async prefetching (#9845) · 3653029d
      Akanksha Mahajan 提交于
      Summary:
      Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS.
      PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by
      FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling
      file_system Poll API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845
      
      Test Plan: Update existing tests
      
      Reviewed By: anand1976
      
      Differential Revision: D35909694
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36
      3653029d
    • R
      Bugfix/fix manual flush blocking bug (#9893) · 6d2577e5
      RoeyMaor 提交于
      Summary:
      Fix https://github.com/facebook/rocksdb/issues/9892
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35880959
      
      Pulled By: ajkr
      
      fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a
      6d2577e5
    • J
      Add 95% confidence intervals to db_bench output (#9882) · fb9a167a
      Jaromir Vanek 提交于
      Summary:
      Enhancing `db_bench` output with 95% statistical confidence intervals for better performance evaluation. The goal is to unambiguously separate random variance when running benchmark over multiple iterations.
      
      Output enhanced with confidence intervals exposed in brackets:
      
      ```
      $ ./db_bench --benchmarks=fillseq[-X10]
      
      Running benchmark for 10 times
      fillseq      :       4.961 micros/op 201578 ops/sec;   22.3 MB/s
      fillseq      :       5.030 micros/op 198824 ops/sec;   22.0 MB/s
      fillseq [AVG 2 runs] : 200201 (± 2698) ops/sec;   22.1 (± 0.3) MB/sec
      fillseq      :       4.963 micros/op 201471 ops/sec;   22.3 MB/s
      fillseq [AVG 3 runs] : 200624 (± 1765) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq      :       5.035 micros/op 198625 ops/sec;   22.0 MB/s
      fillseq [AVG 4 runs] : 200124 (± 1586) ops/sec;   22.1 (± 0.2) MB/sec
      fillseq      :       4.979 micros/op 200861 ops/sec;   22.2 MB/s
      fillseq [AVG 5 runs] : 200272 (± 1262) ops/sec;   22.2 (± 0.1) MB/sec
      fillseq      :       4.893 micros/op 204367 ops/sec;   22.6 MB/s
      fillseq [AVG 6 runs] : 200954 (± 1688) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq      :       4.914 micros/op 203502 ops/sec;   22.5 MB/s
      fillseq [AVG 7 runs] : 201318 (± 1595) ops/sec;   22.3 (± 0.2) MB/sec
      fillseq      :       4.998 micros/op 200074 ops/sec;   22.1 MB/s
      fillseq [AVG 8 runs] : 201163 (± 1415) ops/sec;   22.3 (± 0.2) MB/sec
      fillseq      :       4.946 micros/op 202188 ops/sec;   22.4 MB/s
      fillseq [AVG 9 runs] : 201277 (± 1267) ops/sec;   22.3 (± 0.1) MB/sec
      fillseq      :       5.093 micros/op 196331 ops/sec;   21.7 MB/s
      fillseq [AVG 10 runs] : 200782 (± 1491) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq [AVG    10 runs] : 200782 (± 1491) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq [MEDIAN 10 runs] : 201166 ops/sec;   22.3 MB/s
      ```
      
      For more explicit interval representation, use `--confidence_interval_only` flag:
      
      ```
      $ ./db_bench --benchmarks=fillseq[-X10] --confidence_interval_only
      
      Running benchmark for 10 times
      fillseq      :       4.935 micros/op 202648 ops/sec;   22.4 MB/s
      fillseq      :       5.078 micros/op 196943 ops/sec;   21.8 MB/s
      fillseq [CI95 2 runs] : (194205, 205385) ops/sec; (21.5, 22.7) MB/sec
      fillseq      :       5.159 micros/op 193816 ops/sec;   21.4 MB/s
      fillseq [CI95 3 runs] : (192735, 202869) ops/sec; (21.3, 22.4) MB/sec
      fillseq      :       4.947 micros/op 202158 ops/sec;   22.4 MB/s
      fillseq [CI95 4 runs] : (194721, 203061) ops/sec; (21.5, 22.5) MB/sec
      fillseq      :       4.908 micros/op 203756 ops/sec;   22.5 MB/s
      fillseq [CI95 5 runs] : (196113, 203615) ops/sec; (21.7, 22.5) MB/sec
      fillseq      :       5.063 micros/op 197528 ops/sec;   21.9 MB/s
      fillseq [CI95 6 runs] : (196319, 202631) ops/sec; (21.7, 22.4) MB/sec
      fillseq      :       5.214 micros/op 191799 ops/sec;   21.2 MB/s
      fillseq [CI95 7 runs] : (194953, 201803) ops/sec; (21.6, 22.3) MB/sec
      fillseq      :       5.260 micros/op 190095 ops/sec;   21.0 MB/s
      fillseq [CI95 8 runs] : (193749, 200937) ops/sec; (21.4, 22.2) MB/sec
      fillseq      :       5.076 micros/op 196992 ops/sec;   21.8 MB/s
      fillseq [CI95 9 runs] : (194134, 200474) ops/sec; (21.5, 22.2) MB/sec
      fillseq      :       5.388 micros/op 185603 ops/sec;   20.5 MB/s
      fillseq [CI95 10 runs] : (192487, 199781) ops/sec; (21.3, 22.1) MB/sec
      fillseq [AVG    10 runs] : 196134 (± 3647) ops/sec;   21.7 (± 0.4) MB/sec
      fillseq [MEDIAN 10 runs] : 196968 ops/sec;   21.8 MB/sec
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9882
      
      Reviewed By: pdillinger
      
      Differential Revision: D35796148
      
      Pulled By: vanekjar
      
      fbshipit-source-id: 8313712d16728ff982b8aff28195ee56622385b8
      fb9a167a
    • A
      Add experimental new FS API AbortIO to cancel read request (#9901) · 5bd374b3
      Akanksha Mahajan 提交于
      Summary:
      Add experimental new API AbortIO in FileSystem to abort the
      read requests submitted asynchronously through ReadAsync API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9901
      
      Test Plan: Existing tests
      
      Reviewed By: anand1976
      
      Differential Revision: D35885591
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: df3944e6e9e6e487af1fa688376b4abb6837fb02
      5bd374b3
  3. 23 4月, 2022 1 次提交
  4. 22 4月, 2022 1 次提交
  5. 21 4月, 2022 4 次提交
    • Y
      Add rollback_deletion_type_callback to TxnDBOptions (#9873) · d13825e5
      Yanqin Jin 提交于
      Summary:
      This PR does not affect write-committed.
      
      Add a member, `rollback_deletion_type_callback` to TransactionDBOptions
      so that a write-prepared transaction, when rolling back, can call this
      callback to decide if a `Delete` or `SingleDelete` should be used to
      cancel a prior `Put` written to the database during prepare phase.
      
      The purpose of this PR is to prevent mixing `Delete` and `SingleDelete`
      for the same key, causing undefined behaviors. Without this PR, the
      following can happen:
      
      ```
      // The application always issues SingleDelete when deleting keys.
      
      txn1->Put('a');
      txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax
      txn1->Rollback();  // inserts DELETE('a')
      
      txn2->Put('a');
      txn2->Commit();  // writes to memtable and potentially gets flushed/compacted
      ```
      
      In the database, we may have
      ```
      L0:   [PUT('a', s=100)]
      L1:   [DELETE('a', s=90)]
      Lmax: [PUT('a', s=0)]
      ```
      
      If a compaction compacts L0 and L1, then we have
      ```
      L1:    [PUT('a', s=100)]
      Lmax:  [PUT('a', s=0)]
      ```
      
      If a future transaction issues a SingleDelete, we have
      ```
      L0:    [SD('a', s=110)]
      L1:    [PUT('a', s=100)]
      Lmax:  [PUT('a', s=0)]
      ```
      
      Then, a compaction including L0, L1 and Lmax leads to
      ```
      Lmax:  [PUT('a', s=0)]
      ```
      
      which is incorrect.
      
      Similar bugs reported and addressed in
      https://github.com/cockroachdb/pebble/issues/1255. Based on our team's
      current priority, we have decided to take this approach for now. We may
      come back and revisit in the future.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D35762170
      
      Pulled By: riversand963
      
      fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d
      d13825e5
    • P
      Mark GetLiveFilesStorageInfo ready for production use (#9868) · 1bac873f
      Peter Dillinger 提交于
      Summary:
      ... by filling out remaining testing hole: handling of
      db_pathsi+cf_paths. (Note that while GetLiveFilesStorageInfo works
      with db_paths / cf_paths, Checkpoint and BackupEngine do not and
      are marked appropriately.)
      
      Also improved comments for "live files" APIs, and grouped them
      together in db.h.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9868
      
      Test Plan: Adding to existing unit tests
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35752254
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c70eb67748fad61826e2f554b674638700abefb2
      1bac873f
    • J
      Add 7.2 to compatible check (#9858) · 2ea4205a
      Jay Zhuang 提交于
      Summary:
      Add 7.2 to compatible check (should change it with version update).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9858
      
      Reviewed By: riversand963
      
      Differential Revision: D35722897
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 08c782b9344599d7296543eb0c61afcd9a869a1a
      2ea4205a
    • Y
      Add --decode_blob_index option to idump and dump commands (#9870) · 9b5790f0
      yuzhangyu 提交于
      Summary:
      This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified"
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870
      
      Reviewed By: ltamasi
      
      Differential Revision: D35753932
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9
      9b5790f0
  6. 20 4月, 2022 4 次提交
  7. 19 4月, 2022 5 次提交
    • A
      Avoid overwriting OPTIONS file settings in db_bench (#9862) · 690f1edf
      Andrew Kryczka 提交于
      Summary:
      `InitializeOptionsGeneral()` was overwriting many options that were already configured by OPTIONS file, potentially with the flag default values. This PR changes that function to only overwrite options in limited scenarios, as described at the top of its definition. Block cache is still a violation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9862
      
      Test Plan: ran under various scenarios (multi-DB, single DB, OPTIONS file, flags) and verified options are set as expected
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35736960
      
      Pulled By: ajkr
      
      fbshipit-source-id: 75b77740af37e6f5741618f8a8f5685df2417d03
      690f1edf
    • P
      Misc CI improvements / additions (#9859) · 1601433b
      Peter Dillinger 提交于
      Summary:
      * Add valgrind test to nightly CircleCI (in case it can catch something that
      ASAN/UBSAN does not)
      * Add clang13+asan+ubsan+folly test to nightly CircleCI, for broader testing
      * Consolidate many copies of ASAN_OPTIONS= while also allowing it to be
      inherited from parent environment rather than always overridden.
      * Move UBSAN exclusion from Makefile into options_settable_test.cc
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9859
      
      Test Plan: CI
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35730903
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f5464034e8115f9a07f6f7aec1de9219ec2837c
      1601433b
    • H
      Conditionally declare and define variable that is unused in LITE mode (#9854) · e83c5543
      Hui Xiao 提交于
      Summary:
      Context:
      As mentioned in https://github.com/facebook/rocksdb/issues/9701, we have the following in LITE=1 make static_lib for v7.0.2
      ```
        CC       file/sequence_file_reader.o
        CC       file/sst_file_manager_impl.o
        CC       file/writable_file_writer.o
      In file included from file/writable_file_writer.cc:10:
      ./file/writable_file_writer.h:163:15: error: private field 'temperature_' is not used [-Werror,-Wunused-private-field]
        Temperature temperature_;
                    ^
      1 error generated.
      make: *** [file/writable_file_writer.o] Error 1
      ```
      
       as titled
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9854
      
      Test Plan:
      - Local `LITE=1 make static_lib` reveals the same error and error is gone after this fix
      - CI
      
      Reviewed By: ajkr, jay-zhuang
      
      Differential Revision: D35706585
      
      Pulled By: hx235
      
      fbshipit-source-id: 7743310298231ad6866304ffa2225c8abdc91d9a
      e83c5543
    • P
      Add "no compression" job to CircleCI (#9850) · 41237dd3
      Peter Dillinger 提交于
      Summary:
      Since they operate at distinct abstraction layers, I thought it
      was prudent to combine with EncryptedEnv CI test for each PR, for efficiency
      in testing. Also added supported compressions to sst_dump --help output
      so that CI job can verify no compiled-in compression support.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9850
      
      Test Plan: CI, some manual stuff
      
      Reviewed By: riversand963
      
      Differential Revision: D35682346
      
      Pulled By: pdillinger
      
      fbshipit-source-id: be9879c1533fed304ee32c89fd9ba4b07c2b90cc
      41237dd3
    • J
      Update main version.h to NEXT release (7.3) (#9852) · 3d473235
      Jay Zhuang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9852
      
      Reviewed By: ajkr
      
      Differential Revision: D35694753
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 729d416afc588e5db2367e899589bbb5419820d6
      3d473235
  8. 17 4月, 2022 1 次提交
  9. 16 4月, 2022 7 次提交
    • S
      Add Aggregation Merge Operator (#9780) · 4f9c0fd0
      sdong 提交于
      Summary:
      Add a merge operator that allows users to register specific aggregation function so that they can does aggregation based per key using different aggregation types.
      See comments of function CreateAggMergeOperator() for actual usage.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9780
      
      Test Plan: Add a unit test to coverage various cases.
      
      Reviewed By: ltamasi
      
      Differential Revision: D35267444
      
      fbshipit-source-id: 5b02f31c4f3e17e96dd4025cdc49fca8c2868628
      4f9c0fd0
    • L
      Propagate errors from UpdateBoundaries (#9851) · db536ee0
      Levi Tamasi 提交于
      Summary:
      In `FileMetaData`, we keep track of the lowest-numbered blob file
      referenced by the SST file in question for the purposes of BlobDB's
      garbage collection in the `oldest_blob_file_number` field, which is
      updated in `UpdateBoundaries`. However, with the current code,
      `BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
      in this method. The patch changes this by propagating these errors
      and failing the corresponding flush/compaction. (Note that since blob
      references are generated by the BlobDB code and also parsed by
      `CompactionIterator`, in reality this can only happen in the case of
      memory corruption.)
      
      This change necessitated updating some unit tests that involved
      fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
      `"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
      Some were relying on the earlier behavior to simulate corruption; these
      were replaced with `SyncPoint`-based test code that corrupts a valid
      blob reference at read time.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D35683671
      
      Pulled By: ltamasi
      
      fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
      db536ee0
    • Y
      Add a `fail_if_not_bottommost_level` to IngestExternalFileOptions (#9849) · be81609b
      Yanqin Jin 提交于
      Summary:
      This new options allows application to specify that files must be
      ingested to bottommost level, otherwise the ingestion will fail instead
      of silently ingesting to a non-bottommost level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9849
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D35680307
      
      Pulled By: riversand963
      
      fbshipit-source-id: 01cf54ef6c76198f7654dc06b5544631dea1be1e
      be81609b
    • A
      Make initial auto readahead_size configurable (#9836) · 0c7f455f
      Akanksha Mahajan 提交于
      Summary:
      Make initial auto readahead_size configurable
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9836
      
      Test Plan:
      Added new unit test
      Ran regression:
      Without change:
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
      With this change:
      ```
       ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Set seed to 1649895440554504 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.2
      Date:       Wed Apr 13 17:17:20 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      ... finished 100 ops
      seekrandom   :  476892.488 micros/op 2 ops/sec;  344.6 MB/s (252 of 252 found)
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D35632815
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: c8057a88f9294c9d03b1d434b03affe02f74d796
      0c7f455f
    • S
      Upgrade development environment. (#9843) · d5dfa8c6
      sdong 提交于
      Summary:
      It's to support Meta's internal environment platform010. Gcc still doesn't work but USE_CLANG=1 should work.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9843
      
      Test Plan: Try to make and ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_CLANG=1 make
      
      Reviewed By: pdillinger
      
      Differential Revision: D35652507
      
      fbshipit-source-id: a4a14b2fa4a2d6ca6fbf1b65060e81c39f079363
      d5dfa8c6
    • J
      Remove flaky servicelab metrics DBPut P95/P99 (#9844) · e91ec64c
      Jay Zhuang 提交于
      Summary:
      The P95 and P99 metrics are flaky, similar to DBGet ones which removed
      in https://github.com/facebook/rocksdb/issues/9742 .
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9844
      
      Test Plan: `$ ./buckifier/buckify_rocksdb.py`
      
      Reviewed By: ajkr
      
      Differential Revision: D35655531
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: c1409f0fba4e23d461a65f988c27ac5e2ae85d13
      e91ec64c
    • Y
      Add option --decode_blob_index to dump_live_files command (#9842) · 082eb042
      yuzhangyu 提交于
      Summary:
      This change only add decode blob index support to dump_live_files command, which is part of a task to add blob support to a few commands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9842
      
      Reviewed By: ltamasi
      
      Differential Revision: D35650167
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: a78151b98bc38ac6f52c6e01ca6927a3429ddd14
      082eb042
  10. 15 4月, 2022 4 次提交
    • Y
      Add checks to GetUpdatesSince (#9459) · fe63899d
      Yanqin Jin 提交于
      Summary:
      Make `DB::GetUpdatesSince` return early if told to scan WALs generated by transactions
      with write-prepared or write-unprepared policies (`seq_per_batch` is true), as indicated by
      API comment.
      
      Also add checks to `TransactionLogIterator` to clarify some conditions.
      
      No API change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9459
      
      Test Plan:
      make check
      
      Closing https://github.com/facebook/rocksdb/issues/1565
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D33821243
      
      Pulled By: riversand963
      
      fbshipit-source-id: c8b155d020ce0980e2d3b3b1da40b96e65b48d79
      fe63899d
    • Y
      CompactionIterator sees consistent view of which keys are committed (#9830) · 0bd4dcde
      Yanqin Jin 提交于
      Summary:
      **This PR does not affect the functionality of `DB` and write-committed transactions.**
      
      `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
      As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
      it is committed. In fact, the implementation of `KeyCommitted()` is as follows:
      
      ```
      inline bool KeyCommitted(SequenceNumber seq) {
        // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
        return snapshot_checker_ == nullptr ||
               snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
      }
      ```
      
      With that being said, we focus on write-prepared/write-unprepared transactions.
      
      A few notes:
      - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
      - `CompactionIterator` outputs a key as long as the key is uncommitted.
      
      Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
      doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
      committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
      Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.
      
      To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
      for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
      processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
      to determine whether a key is committed or not with minor change to `KeyCommitted()`.
      
      ```
      inline bool KeyCommitted(SequenceNumber sequence) {
        // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
        return snapshot_checker_ == nullptr ||
               snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
                   SnapshotCheckerResult::kInSnapshot;
      }
      ```
      
      As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
      for `CompactionIterator`s assertions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D35561162
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
      0bd4dcde
    • J
      Fix minimum libzstd version that supports ZSTD_STREAMING (#9841) · 844a3510
      Jonathan Albrecht 提交于
      Summary:
      The minimum libzstd version that has `ZSTD_compressStream2` is
      1.4.0 so only define ZSTD_STREAMING in that case.
      
      Fixes building on Ubuntu 18.04 which has libzstd 1.3.3 as its
      repository version.
      
      Fixes https://github.com/facebook/rocksdb/issues/9795
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9841
      
      Test Plan:
      Build and test on Ubuntu 18.04 with:
        apt-get install libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \
          libzstd-dev libgflags-dev g++ make curl
      
      Reviewed By: ajkr
      
      Differential Revision: D35648738
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 2a9e969bcc17a7dc10172f3817283409de885811
      844a3510
    • A
      Expose `CacheEntryRole` and map keys for block cache stat collections (#9838) · d6e016be
      Andrew Kryczka 提交于
      Summary:
      This gives users the ability to examine the map populated by `GetMapProperty()` with property `kBlockCacheEntryStats`. It also sets us up for a possible future where cache reservations are configured according to `CacheEntryRole`s rather than flags coupled to roles.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9838
      
      Test Plan:
      - migrated test DBBlockCacheTest.CacheEntryRoleStats to use this API. That test verifies some of the contents are as expected
      - added a DBPropertiesTest to verify the public map keys are present, and nothing else
      
      Reviewed By: hx235
      
      Differential Revision: D35629493
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5c4356b8560e85d1f881fd32c44c15960b02fc68
      d6e016be
  11. 14 4月, 2022 5 次提交