1. 16 6月, 2022 5 次提交
    • C
      Verify write batch checksum before WAL (#10114) · 9882652b
      Changyu Bi 提交于
      Summary:
      Context: WriteBatch can have key-value checksums when it was created `with protection_bytes_per_key > 0`.
      This PR added checksum verification for write batches before they are written to WAL.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10114
      
      Test Plan:
      - Added new unit tests to db_kv_checksum_test.cc: `make check -j32`
      - benchmark on performance regression: `./db_bench --benchmarks=fillrandom[-X20] -db=/dev/shm/test_rocksdb -write_batch_protection_bytes_per_key=8`
        - Pre-PR:
      `
      fillrandom [AVG    20 runs] : 198875 (± 3006) ops/sec;   22.0 (± 0.3) MB/sec
      `
        - Post-PR:
      `
      fillrandom [AVG    20 runs] : 196487 (± 2279) ops/sec;   21.7 (± 0.3) MB/sec
      `
        Mean regressed about 1% (198875 -> 196487 ops/sec).
      
      Reviewed By: ajkr
      
      Differential Revision: D36917464
      
      Pulled By: cbi42
      
      fbshipit-source-id: 29beb74edf65f04b1a890b4f650d873dc7ed790d
      9882652b
    • A
      Change the instruction used for a pause on arm64 (#10118) · 2e5a323d
      Ali Saidi 提交于
      Summary:
      While the yield instruction conseptually sounds correct on most platforms it is
      a simple nop that doesn't delay the execution anywhere close to what an x86
      pause instruction does. In other projects with spin-wait loops an isb has been
      observed to be much closer to the x86 behavior.
      
      On a Graviton3 system the following test improves on average by 2x with this
      change averaged over 20 runs:
      
      ```
      ./db_bench  -benchmarks=fillrandom -threads=64 -batch_size=1
      -memtablerep=skip_list -value_size=100 --num=100000
      level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
      -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8
      --disable_wal --write_buffer_size=160000000 --block_size=16384
      --allow_concurrent_memtable_write -compression_type none
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10118
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37120578
      
      fbshipit-source-id: c20bde4298222edfab7ff7cb6d42497e7012400d
      2e5a323d
    • S
      Use madvise() for mmaped file advise (#10170) · 69a32eec
      sdong 提交于
      Summary:
      A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170
      
      Test Plan:
      Run existing tests
      Run a benchmark using mmap with advise random and see I/O size is indeed small.
      
      Reviewed By: anand1976
      
      Differential Revision: D37158582
      
      fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3
      69a32eec
    • Y
      Allow db_bench and db_stress to set `allow_data_in_errors` (#10171) · ce419c0f
      Yanqin Jin 提交于
      Summary:
      There is `Options::allow_data_in_errors` that controls whether RocksDB
      is allowed to log data, e.g. key, value, etc in LOG files. It is false
      by default. However, in db_bench and db_stress, it is often ok to log
      data because there is no concern about privacy.
      
      This PR allows db_stress and db_bench to set this option on the command
      line, while it remains false by default. Furthermore, make
      crash/recovery test driven by db_crashtest.py to opt-in.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171
      
      Test Plan: Stress test and db_bench
      
      Reviewed By: hx235
      
      Differential Revision: D37163787
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
      ce419c0f
    • A
      fix cancel argument for latest liburing (#10168) · 19345de6
      Akanksha Mahajan 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10168
      
      the arg changed to u64
      
      Reviewed By: ajkr
      
      Differential Revision: D37155407
      
      fbshipit-source-id: 464eab2806675f148fce075a6fea369fa3d7a9bb
      19345de6
  2. 15 6月, 2022 8 次提交
    • I
      Fix C4702 on windows (#10146) · 40dfa260
      iseki 提交于
      Summary:
      This code is unreachable when `ROCKSDB_LITE` not defined. And it cause build fail on my environment VS2019 16.11.15.
      ```
      -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19044.
      -- The CXX compiler identification is MSVC 19.29.30145.0
      -- The C compiler identification is MSVC 19.29.30145.0
      -- The ASM compiler identification is MSVC
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10146
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37112916
      
      Pulled By: ajkr
      
      fbshipit-source-id: e0b2bf3055d6fac1b3fb40b9f02c4cbae3f82757
      40dfa260
    • M
      Fix potential leak when reusing PinnableSlice instances. (#10166) · 77f47995
      mpoeter 提交于
      Summary:
      `PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this:
      ```
      PinnableSlice pin_slice;
      db.Get("foo", &pin_slice);
      db.Get("foo", &pin_slice);
      ```
      then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released.
      
      This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166
      
      Reviewed By: hx235
      
      Differential Revision: D37151632
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff
      77f47995
    • A
      Modify the instructions emited for PREFETCH on arm64 (#10117) · b550fc0b
      Ali Saidi 提交于
      Summary:
      __builtin_prefetch(...., 1) prefetches into the L2 cache on x86 while the same
      emits a pldl3keep instruction on arm64 which doesn't seem to be close enough.
      
      Testing on a Graviton3, and M1 system with memtablerep_bench fillrandom and
      skiplist througpuh increased as follows adjusting the 1 to 2 or 3:
      ```
                 1 -> 2     1 -> 3
      ----------------------------
      Graviton3   +10%        +15%
      M1          +10%        +10%
      ```
      
      Given that prefetching into the L1 cache seems to help, I chose that conversion
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10117
      
      Reviewed By: pdillinger
      
      Differential Revision: D37120475
      
      fbshipit-source-id: db1ef43f941445019c68316500a2250acc643d5e
      b550fc0b
    • J
      mingw: remove no-asynchronous-unwind-tables (#9963) · 751d1a3e
      James Tucker 提交于
      Summary:
      This default is generally incompatible with other parts of mingw, and
      can be applied by outside users as-needed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9963
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36302813
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9456b41a96bde302bacbc39e092ccecfcb42f34f
      751d1a3e
    • G
      Add blob cache option in the column family options (#10155) · cba398df
      Gang Liao 提交于
      Summary:
      There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155
      
      Reviewed By: ltamasi
      
      Differential Revision: D37150819
      
      Pulled By: gangliao
      
      fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2
      cba398df
    • T
      fix a false positive case of parsing table factory from options file (#10094) · 1d2950b8
      tabokie 提交于
      Summary:
      During options file parsing, reset table factory before attempting to parse it
      from string. This avoids mistakenly treating the default table factory as a
      newly created one.
      Signed-off-by: Ntabokie <xy.tao@outlook.com>
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10094
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36945378
      
      Pulled By: ajkr
      
      fbshipit-source-id: 94b2604e5e87682063b4b78f6370f3e8f101dc44
      1d2950b8
    • H
      Account memory of FileMetaData in global memory limit (#9924) · d665afdb
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
      
      Test Plan:
      - Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
      - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
      - db bench (CPU cost of `charge_file_metadata` in write and compact)
         - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
         - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
      
      table 1 - write
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
      80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
      160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
      
      table 2 - compact
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
      20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
      40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
      80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
      
      - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36055583
      
      Pulled By: hx235
      
      fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
      d665afdb
    • A
      Fix the failure related to io_uring_prep_cancel (#10165) · 40d19bc1
      Akanksha Mahajan 提交于
      Summary:
      Fix for Internal jobs are failing with
      ```
       error: no matching function for call to 'io_uring_prep_cancel'
            io_uring_prep_cancel(sqe, posix_handle, 0);
            ^~~~~~~~~~~~~~~~~~~~
      note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle *' to '__u64' (aka 'unsigned long long') for 2nd argument
      static inline void io_uring_prep_cancel(struct io_uring_sqe *sqe,
      ```
      
      User data is set using `io_uring_set_data` API so no need to pass posix_handle here.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165
      
      Test Plan: CircleCI jobs
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37145233
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a
      40d19bc1
  3. 14 6月, 2022 5 次提交
    • G
      Make the per-shard hash table fixed-size. (#10154) · f105e1a5
      Guido Tagliavini Ponce 提交于
      Summary:
      We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
      
      Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
      
      Test Plan: `make -j24 check`
      
      Reviewed By: pdillinger
      
      Differential Revision: D37124451
      
      Pulled By: guidotag
      
      fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
      f105e1a5
    • Y
      Fix a race condition in transaction stress test (#10157) · bfaf8291
      Yanqin Jin 提交于
      Summary:
      Before this PR, there can be a race condition between the thread calling
      `StressTest::Open()` and a background compaction thread calling
      `MultiOpsTxnsStressTest::VerifyPkSkFast()`.
      
      ```
      Time   thread1                             bg_compact_thr
       |     TransactionDB::Open(..., &txn_db_)
       |     db_ is still nullptr
       |                                         db_->GetSnapshot()  // segfault
       |     db_ = txn_db_
       V
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157
      
      Test Plan: CI
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37121653
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331
      bfaf8291
    • A
      Implement AbortIO using io_uring (#10125) · c0e0f306
      Akanksha Mahajan 提交于
      Summary:
      Implement AbortIO in posix using io_uring to cancel any pending read requests submitted. Its cancelled using io_uring_prep_cancel which sets the IORING_OP_ASYNC_CANCEL flag.
      
      To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT.
      
      Reference: https://kernel.dk/io_uring-whatsnew.pdf,
      https://lore.kernel.org/io-uring/d9a8d76d23690842f666c326631ecc2d85b6c1bc.1615566409.git.asml.silence@gmail.com/
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10125
      
      Test Plan: Existing Posix tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D36946970
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 3bc1f1521b3151d01a348fc6431eb3fc85db3a14
      c0e0f306
    • M
      Increase num_levels for universal from 8 to 40 (#10158) · 04bd3479
      Mark Callaghan 提交于
      Summary:
      See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move
      isn't done for universal when compaction is from L0 into L0. So a too small value for
      num_levels with db_bench means there will be fewer trivial moves with universal and
      that means that write-amp will increase.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158
      
      Test Plan: run it
      
      Reviewed By: siying
      
      Differential Revision: D37122519
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e
      04bd3479
    • P
      Document design/specification bugs with auto_prefix_mode (#10144) · ad135f3f
      Peter Dillinger 提交于
      Summary:
      auto_prefix_mode is designed to use prefix filtering in a
      particular "safe" set of cases where the upper bound and the seek key
      have different prefixes: where the upper bound is the "same length
      immediate successor". These conditions are not sufficient to guarantee
      the same iteration results as total_order_seek if the DB contains
      "short" keys, less than the "full" (maximum) prefix length.
      
      We are not simply disabling the optimization in these successor cases
      because it is likely that users are essentially getting what they want
      out of existing usage. Especially if users are constructing successor
      bounds with the intention of doing a prefix-bounded seek, the existing
      behavior is more expected than the total_order_seek behavior.
      Consequently, for now we reconcile the bad specification of behavior by
      documenting the existing mismatch with total_order_seek.
      
      A closely related issue affects hypothetical comparators like
      ReverseBytewiseComparator: if they "correctly" implement
      IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
      entries (other than "short" keys noted above). Luckily, the built-in
      ReverseBytewiseComparator has an "incorrect" implementation of
      IsSameLengthImmediateSuccessor that effectively prevents prefix
      optimization and, thus, the bug. This is now documented as a new
      constraint on IsSameLengthImmediateSuccessor, and the implementation
      tweaked to be simply "safe" rather than "incorrect".
      
      This change also includes unit test updates to demonstrate the above
      issues. (Test was cleaned up for readability and simplicity.)
      
      Intended follow-up:
      * Tweak documented axioms for prefix_extractor (more details then)
      * Consider some sort of fix for this case. I don't know what that would
      look like without breaking the performance of existing code. Perhaps
      if all keys in an SST file have prefixes that are "full length," we can track
      that fact and use it to allow optimization with the "same length
      immediate successor", but that would only apply to new files.
      * Consider a better system of specifying prefix bounds
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
      
      Test Plan: test updates included
      
      Reviewed By: siying
      
      Differential Revision: D37052710
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
      ad135f3f
  4. 13 6月, 2022 1 次提交
  5. 11 6月, 2022 4 次提交
    • G
      Assume fixed size key (#10137) · 415200d7
      Guido Tagliavini Ponce 提交于
      Summary:
      FastLRUCache now only supports 16B keys. The tests have changed to reflect this.
      
      Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137
      
      Test Plan: make -j24 check
      
      Reviewed By: gitbw95
      
      Differential Revision: D37083934
      
      Pulled By: guidotag
      
      fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d
      415200d7
    • S
      Run fadvise with mmap file (#10142) · 80afa776
      sdong 提交于
      Summary:
      Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142
      
      Test Plan:
      A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8:
      
      ./db_bench -use_existing_db --benchmarks=readrandom --num=100
      
      Reviewed By: anand1976
      
      Differential Revision: D37074975
      
      fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6
      80afa776
    • Y
      Snapshots with user-specified timestamps (#9879) · 1777e5f7
      Yanqin Jin 提交于
      Summary:
      In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
      to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
      
      It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
      do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
      and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
      
      This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
      https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
      Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
      
      In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
      object is created with the last published sequence number of the super-version. You can see that the reader actually
      has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
      an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
      sequence number is written.
      
      This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
      exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
      commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
      ensure any two snapshots with timestamps should satisfy the following:
      ```
      snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
      ```
      
      If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
      in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
      a snapshot with associated timestamp.
      
      Code example
      ```cpp
      // Create a timestamped snapshot when committing transaction.
      txn->SetCommitTimestamp(100);
      txn->SetSnapshotOnNextOperation();
      txn->Commit();
      
      // A wrapper API for convenience
      Status Transaction::CommitAndTryCreateSnapshot(
          std::shared_ptr<TransactionNotifier> notifier,
          TxnTimestamp ts,
          std::shared_ptr<const Snapshot>* ret);
      
      // Create a timestamped snapshot if caller guarantees no concurrent writes
      std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
      ```
      
      The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
      other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
      ```cpp
      // Return the timestamped snapshot correponding to given timestamp. If ts is
      // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
      // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
      // such snapshot exists, then we return null.
      std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
      // Return the latest timestamped snapshot if present.
      std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
      ```
      
      We also provide two additional APIs for stats collection and reporting purposes.
      
      ```cpp
      Status TransactionDB::GetAllTimestampedSnapshots(
          std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
      // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
      Status TransactionDB::GetTimestampedSnapshots(
          TxnTimestamp ts_lb,
          TxnTimestamp ts_ub,
          std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
      ```
      
      To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
      timestamped snapshots whose timestamps are older than or equal to a given threshold.
      ```cpp
      void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
      ```
      
      Before shutdown, RocksDB will release all timestamped snapshots.
      
      Comparison with user-defined timestamp and how they can be combined:
      User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
      mapping between snapshots (sequence numbers) and timestamps.
      Different internal keys with the same user key but different timestamps will be treated as different by compaction,
      thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
      In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
      this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
      The timestamped snapshot supports the semantics of reading at an exact point in time.
      
      Timestamped snapshots can also be used with user-defined timestamp.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
      
      Test Plan:
      ```
      make check
      TEST_TMPDIR=/dev/shm make crash_test_with_txn
      ```
      
      Reviewed By: siying
      
      Differential Revision: D35783919
      
      Pulled By: riversand963
      
      fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
      1777e5f7
    • G
      Enable SecondaryCache::CreateFromString to create sec cache based on the uri... · f4052d13
      gitbw95 提交于
      Enable SecondaryCache::CreateFromString to create sec cache based on the uri for CompressedSecondaryCache (#10132)
      
      Summary:
      Update SecondaryCache::CreateFromString and enable it to create sec cache based on the uri for CompressedSecondaryCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10132
      
      Test Plan: Add unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D36996997
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 882ad563cff6d38b306a53426ad7e47273f34edc
      f4052d13
  6. 10 6月, 2022 5 次提交
    • P
      Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128) · d3a3b021
      Peter Dillinger 提交于
      Summary:
      When opening an SST file created using index_type=kHashSearch,
      the *current* prefix_extractor would be saved, and used with hash index
      if the *new current* prefix_extractor at query time is compatible with
      the SST file. This is a problem if the prefix_extractor at SST open time
      is not compatible but SetOptions later changes (back) to one that is
      compatible.
      
      This change fixes that by using the known compatible (or missing) prefix
      extractor we save for use with prefix filtering. Detail: I have moved the
      InternalKeySliceTransform wrapper to avoid some indirection and remove
      unnecessary fields.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128
      
      Test Plan:
      expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails
      before fix and probably covers some other previously uncovered cases.
      
      Reviewed By: siying
      
      Differential Revision: D36955738
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10
      d3a3b021
    • Y
      Return try again when full_history_ts_low is higher than requested ts (#10126) · 693dffd8
      Yu Zhang 提交于
      Summary:
      This PR helps handle the race condition mentioned in this comment thread: https://github.com/facebook/rocksdb/pull/7884#discussion_r572402281 In case where actual full_history_ts_low is higher than the user's requested ts, return a try again message so they don't have the misconception that data between [ts, full_history_ts_low) is kept.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10126
      
      Test Plan:
      ```
      $COMPILE_WITH_ASAN=1 make -j24 all
      $./db_with_timestamp_basic_test --gtest_filter=UpdateFullHistoryTsLowTest.ConcurrentUpdate
      $ make -j24 check
      ```
      
      Reviewed By: riversand963
      
      Differential Revision: D37055368
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 787fd0984a246540fa03ac227b1d232590d27828
      693dffd8
    • P
      Fix fragile CacheTest::ApplyToAllEntriesDuringResize (#10145) · 5fa6ef7f
      Peter Dillinger 提交于
      Summary:
      As seen in https://github.com/facebook/rocksdb/issues/10137, simply churning the cache key hashes (e.g.
      by changing the raw cache keys) could trigger failure in this test, due
      to possibility of some cache shard exceeding its portion of capacity
      and evicting entries. Updated the test to be less fragile by using
      greater margins, and added a pre-check for evictions, which doesn't
      manifest as a race condition, before the main check that can race.
      
      Also added stack trace handler to cache_test for debugging.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10145
      
      Test Plan:
      test thousands of iterations with gtest-parallel, including
      with changes in https://github.com/facebook/rocksdb/issues/10137 that were surfacing the problem. Pre-check
      without the fix would always fail with https://github.com/facebook/rocksdb/issues/10137
      
      Reviewed By: guidotag
      
      Differential Revision: D37058771
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a7cf137967aef49c07ae9602d8523c63e7388fab
      5fa6ef7f
    • B
      Update jemalloc version for platform009 (#10143) · 1a3e23a2
      Bo Wang 提交于
      Summary:
      Update jemalloc version for platform009. Current one is a bit old and the new one can bring some quick wins (e.g. new heap profiling features on devserver).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10143
      
      Test Plan:
      1. The building and testing on devserver should work.
      2. `db_bench` with `--dump_malloc_stats`
      `./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1 `
      `./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1 `
      `./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000  --statistics -db=/db_bench_1 --dump_malloc_stats=true`
      
      Before this PR: jemalloc Version: "5.2.1-1303-g73b8faa7149e452f93e52005c89459da08343570"
      After this PR: jemalloc Version:
      
      Reviewed By: anand1976
      
      Differential Revision: D37049347
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 3fcd82cca989047b4bbdfdebe5beba2c4c255ed8
      1a3e23a2
    • A
      Enable wal_compression in crash_tests (#10141) · ecfd4aef
      Akanksha Mahajan 提交于
      Summary:
      Same as title
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10141
      
      Test Plan:
      ```
      export CRASH_TEST_EXT_ARGS=" --wal_compression=zstd"
       make crash_test -j
      ```
      
      Reviewed By: riversand963
      
      Differential Revision: D37042810
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 53f0793d78241f1b5c954dcc808cb4c0a3e9172a
      ecfd4aef
  7. 09 6月, 2022 2 次提交
    • A
      Fix bug for WalManager with compressed WAL (#10130) · f85b31a2
      Akanksha Mahajan 提交于
      Summary:
      RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value.
      This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number.
      Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130
      
      Test Plan: - Newly Added test
      
      Reviewed By: riversand963
      
      Differential Revision: D36985744
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f
      f85b31a2
    • M
      Fix parsing of db_bench output (#10124) · 9efae144
      Mark Callaghan 提交于
      Summary:
      A recent diff add a few more fields to one of the db_bench output lines that gets parsed.
      This diff updates tools/benchmark.sh to handle that.
      
      overwrite    :       7.939 micros/op 125963 ops/sec;   50.5 MB/s
      
      overwrite    :       7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations;   51.0 MB/s
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124
      
      Test Plan: Run it
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36945137
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa
      9efae144
  8. 08 6月, 2022 4 次提交
    • Y
      Update test for secondary instance in stress test (#10121) · f890527b
      Yanqin Jin 提交于
      Summary:
      This PR updates secondary instance testing in stress test by default.
      
      A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.
      
      Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
      to read some random keys with only very basic verification to make sure no assertion failure is triggered.
      
      Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.
      
      Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
      catch-up.
      
      In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
      work well when the underlying file is renamed by primary, e.g. SstFileManager.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121
      
      Test Plan:
      ```
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D36939458
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
      f890527b
    • A
      Set db_stress defaults for TSAN deadlock detector (#10131) · ff323464
      Andrew Kryczka 提交于
      Summary:
      After https://github.com/facebook/rocksdb/issues/9357 we began seeing the following error attempting to acquire
      locks for file ingestion:
      
      ```
      FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
      ```
      
      The command was using default values for `ingest_external_file_width`
      (1000) and `log2_keys_per_lock` (2). The expected number of locks needed
      to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64)
      limit. This PR reduces the default value of `ingest_external_file_width`
      to 100 so the expected number of locks is 25, which is within the limit.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10131
      
      Reviewed By: ltamasi
      
      Differential Revision: D36986307
      
      Pulled By: ajkr
      
      fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1
      ff323464
    • G
      Add unit test to verify that the dynamic priority can be passed from compaction to FS (#10088) · 5cbee1f6
      gitbw95 提交于
      Summary:
      **Summary:**
      Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088
      
      Test Plan: Add unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D36882528
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1
      5cbee1f6
    • Z
      Handle "NotSupported" status by default implementation of Close() in … (#10127) · b6de139d
      zczhu 提交于
      Summary:
      The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127
      
      Reviewed By: ajkr
      
      Differential Revision: D36971112
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466
      b6de139d
  9. 07 6月, 2022 5 次提交
    • Z
      Consolidate manual_compaction_paused_ check (#10070) · 3ee6c9ba
      zczhu 提交于
      Summary:
      As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `*canceled` to be true in `DisableManualCompaction()` and `*canceled` to be false in the last time calling `EnableManualCompaction()`.
      
      Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070
      
      Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test).
      
      Reviewed By: ajkr
      
      Differential Revision: D36949133
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821
      3ee6c9ba
    • Y
      Return "invalid argument" when read timestamp is too old (#10109) · a101c9de
      Yu Zhang 提交于
      Summary:
      With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument().
      Test plan
      ```
      $COMPILE_WITH_ASAN=1 make -j24 all
      $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow
      $ make -j24 check
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109
      
      Reviewed By: riversand963
      
      Differential Revision: D36901126
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86
      a101c9de
    • Z
      Fix default implementaton of close() function for Directory/FSDirecto… (#10123) · 9f244b21
      zczhu 提交于
      Summary:
      As pointed by anand1976 in his [comment](https://github.com/facebook/rocksdb/pull/10049#pullrequestreview-994255819), previous implementation (adding Close() function in Directory/FSDirectory class) is not backward-compatible. And we mistakenly added the default implementation `return Status::NotSupported("Close")` or `return IOStatus::NotSupported("Close")` in WritableFile class in this [pull request](https://github.com/facebook/rocksdb/pull/10101). This pull request fixes the above issue.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10123
      
      Reviewed By: ajkr
      
      Differential Revision: D36943661
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 9dc45f4d2ab3a9d51c30bdfde679f1d13c4d5509
      9f244b21
    • G
      Fix overflow bug in standard deviation computation. (#10100) · 2af132c3
      Guido Tagliavini Ponce 提交于
      Summary:
      There was an overflow bug when computing the variance in the HistogramStat class.
      
      This manifests, for instance, when running cache_bench with default arguments. This executes 32M lookups/inserts/deletes in a block cache, and then computes (among other things) the variance of the latencies. The variance is computed as ``variance = (cur_sum_squares * cur_num - cur_sum * cur_sum) / (cur_num * cur_num)``, where ``cum_sum_squares`` is the sum of the squares of the samples, ``cur_num`` is the number of samples, and ``cur_sum`` is the sum of the samples. Because the median latency in a typical run is around 3800 nanoseconds, both the ``cur_sum_squares * cur_num`` and ``cur_sum * cur_sum`` terms overflow as uint64_t.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10100
      
      Test Plan: Added a unit test. Run ``make -j24 histogram_test && ./histogram_test``.
      
      Reviewed By: pdillinger
      
      Differential Revision: D36942738
      
      Pulled By: guidotag
      
      fbshipit-source-id: 0af5fb9e2a297a284e8e74c24e604d302906006e
      2af132c3
    • P
      Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098) · 4f78f969
      Peter Dillinger 提交于
      Summary:
      We have three related concepts:
      * BlockType: an internal enum conceptually indicating a type of SST file
      block
      * CacheEntryRole: a user-facing enum for categorizing block cache entries,
      which is also involved in associated cache entries with an appropriate
      deleter. Can include categories for non-block cache entries (e.g. memory
      reservations).
      * TBlocklike: a C++ type for the actual type behind a void* cache entry.
      
      We had some existing code ugliness because BlockType did not imply
      TBlocklike, because of various kinds of "filter" block. This refactoring
      fixes that with new BlockTypes.
      
      More clean-up can come in later work.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098
      
      Test Plan: existing tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36897945
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295
      4f78f969
  10. 06 6月, 2022 1 次提交