1. 13 3月, 2022 2 次提交
    • J
      Fix a timer crash caused by invalid memory management (#9656) · 09b0e8f2
      Jay Zhuang 提交于
      Summary:
      Timer crash when multiple DB instances doing heavy DB open and close
      operations concurrently. Which is caused by adding a timer task with
      smaller timestamp than the current running task. Fix it by moving the
      getting new task timestamp part within timer mutex protection.
      And other fixes:
      - Disallow adding duplicated function name to timer
      - Fix a minor memory leak in timer when a running task is cancelled
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656
      
      Reviewed By: ajkr
      
      Differential Revision: D34626296
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
      09b0e8f2
    • J
      Reduce Windows build parallelism number (#9687) · 91372328
      Jay Zhuang 提交于
      Summary:
      To avoid OOM issue for VS2007 build.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9687
      
      Test Plan: Run VS2007 build 5 times, seems fine.
      
      Reviewed By: ajkr
      
      Differential Revision: D34845073
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 60f84885e391e878ee6f3b1945376323baf47ec5
      91372328
  2. 12 3月, 2022 2 次提交
  3. 11 3月, 2022 2 次提交
    • A
      Posix API support for Async Read and Poll APIs (#9578) · 8465cccd
      Akanksha Mahajan 提交于
      Summary:
      Provide support for Async Read and Poll in Posix file system using IOUring.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9578
      
      Test Plan: In progress
      
      Reviewed By: anand1976
      
      Differential Revision: D34690256
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 291cbd1380a3cb904b726c34c0560d1b2ce44a2e
      8465cccd
    • B
      Fix mempurge crash reported in #8958 (#9671) · 7bed6595
      Baptiste Lemaire 提交于
      Summary:
      Change the `MemPurge` code to address a failure during a crash test reported in https://github.com/facebook/rocksdb/issues/8958.
      
      ### Details and results of the crash investigation:
      These failures happened in a specific scenario where the list of immutable tables was composed of 2 or more memtables, and the last memtable was the output of a previous `Mempurge` operation. Because the `PickMemtablesToFlush` function included a sorting of the memtables (previous PR related to the Mempurge project), and because the `VersionEdit` of the flush class is piggybacked onto a single one of these memtables, the `VersionEdit` was not properly selected and applied to the `VersionSet` of the DB. Since the `VersionSet` was not edited properly, the database was losing track of the SST file created during the flush process, which was subsequently deleted (and as you can expect, caused the tests to crash).
      The following command consistently failed, which was quite convenient to investigate the issue:
      `$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done`
      
      ### Solution proposed
      The memtables are no longer sorted based on their `memtableID` in the `PickMemtablesToFlush` function. Additionally, the `next_log_number` of the memtable created as an output of the `Mempurge` function now takes in the correct value (the log number of the first memtable being mempurged). Finally, the VersionEdit object of the flush class now takes the maximum `next_log_number` of the stack of memtables being flushed, which doesnt change anything when Mempurge is `off` but becomes necessary when Mempurge is `on`.
      
      ### Testing of the solution
      The following command no longer fails:
      ``$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done``
      Additionally, I ran `db_crashtest` (`whitebox` and `blackbox`) for 2.5 hours with MemPurge on and did not observe any crash.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9671
      
      Reviewed By: pdillinger
      
      Differential Revision: D34697424
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: d1ab675b361904351ac81a35c184030e52222874
      7bed6595
  4. 10 3月, 2022 3 次提交
  5. 09 3月, 2022 5 次提交
    • Y
      Support user-defined timestamps in write-committed txns (#9629) · 3b6dc049
      Yanqin Jin 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629
      
      Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
      locked upon first operation that writes the key or has the intention of writing. For example,
      `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
      `PessimisticTransaction::SingleDelete()` will write to or delete a key, while
      `PessimisticTransaction::GetForUpdate()` is used by application to indicate
      to RocksDB that the transaction has the intention of performing write operation later
      in the same transaction.
      Pessimistic transactions support two-phase commit (2PC). A transaction can be
      `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
      `Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
      The resources include locks, persistence of WAL, etc.
      Write-committed transaction is the default pessimistic transaction implementation. In
      RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
      section. `Commit()` will write a commit marker to the WAL and then write data to the
      memtables. While writing to the memtables, different keys in the transaction's write batch
      will be assigned different sequence numbers in ascending order.
      Until commit/rollback, the transaction holds locks on the keys so that no other transaction
      can write to the same keys. Furthermore, the keys' sequence numbers represent the order
      in which they are committed and should be made visible. This is convenient for us to
      implement support for user-defined timestamps.
      Since column families with and without timestamps can co-exist in the same database,
      a transaction may or may not involve timestamps. Based on this observation, we add two
      optional members to each `PessimisticTransaction`, `read_timestamp_` and
      `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
      setting these two variables do not have any effect. For the rest of this commit, we discuss
      only the cases when these two variables are meaningful.
      
      read_timestamp_ is used mainly for validation, and should be set before first call to
      `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
      `TryLock()` that can verify if another transaction has written the same key since
      `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
      written the same key, then validation fails, and RocksDB allows this transaction to
      refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
      with a different timestamp to read, but the result of the read should not be used to
      determine data that will be written later.
      
      commit_timestamp_ must be set after finishing writing and before transaction commit.
      This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
      prepare phase succeeds.
      
      We currently require that the commit timestamp be chosen after all keys are locked. This
      means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
      by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
      `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
      etc. will return non-ok status because they specify timestamps before locking the keys.
      Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.
      
      Reviewed By: ltamasi
      
      Differential Revision: D31822445
      
      fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
      3b6dc049
    • H
      Rate-limit automatic WAL flush after each user write (#9607) · ca0ef54f
      Hui Xiao 提交于
      Summary:
      **Context:**
      WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.
      
      Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as  `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes),  for the benefits of:
      - being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
      - being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)
      
      `WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
      - The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL`  right now before improving that scheduling.
      
      A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.
      
      **Summary:**
      - Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
      - Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
      - Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607
      
      Test Plan:
      - Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
      - Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
      - `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
         - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
         - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
      - crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0  --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal
      
      **Benchmarked on flush/compaction to ensure no performance regression:**
      - compaction with rate-limiting  (see table 1, avg over 1280-run):  pre-change: **915635 micros/op**; post-change:
         **907350 micros/op (improved by 0.106%)**
      ```
      #!/bin/bash
      TEST_TMPDIR=/dev/shm/testdb
      START=1
      NUM_DATA_ENTRY=8
      N=10
      
      rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
      for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
      do
          NUM_RUN=$(($N*(2**($i-1))))
          for j in $(eval echo "{$START..$NUM_RUN}")
          do
             ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
          done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
      done
      ```
      - compaction w/o rate-limiting  (see table 2, avg over 640-run):  pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
      ```
      Same as above script, except that -rate_limiter_bytes_per_sec=0
      ```
      - flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](https://github.com/hx235/rocksdb/commit/ee5c6023a9f6533fab9afdc681568daa21da4953) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
      ```
       #!/bin/bash
      TEST_TMPDIR=/dev/shm/testdb
      START=1
      NUM_DATA_ENTRY=8
      N=10
      
      rm -f flush_bmk_output.txt flush_bmk_output_2.txt
      
      for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
      do
          NUM_RUN=$(($N*(2**($i-1))))
          for j in $(eval echo "{$START..$NUM_RUN}")
          do
             ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
          done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
      done
      
      ```
      - flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](https://github.com/hx235/rocksdb/commit/ee5c6023a9f6533fab9afdc681568daa21da4953) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
      ```
      Same as above script, except that -rate_limiter_bytes_per_sec=0
      ```
      
      | table 1 - compact with rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
      -- | -- | -- | -- | -- | --
      10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
      20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
      40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
      80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
      160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
      3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
      6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
      **1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
      average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682
      
      | table 2 - compact w/o rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
      -- | -- | -- | -- | -- | --
      10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
      20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
      40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
      80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
      160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
      3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
      **6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
      average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 |  -0.25
      
      | table 3 - flush with rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
      -- | -- | -- | -- | -- | --
      10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
      20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
      40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
      80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
      160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
      **3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
      6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
      average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238
      
      | table 4 - flush w/o rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op (%)
      -- | -- | -- | -- | -- | --
      10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
      20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
      40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
      80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
      160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
      **3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
      6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
      average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00
      
      Reviewed By: ajkr
      
      Differential Revision: D34442441
      
      Pulled By: hx235
      
      fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
      ca0ef54f
    • E
      Rename mutable_cf_options to signify explicity copy (#9666) · 27d6ef8e
      Ezgi Çiçek 提交于
      Summary:
      Signify explicit copy with comment and better name for variable `mutable_cf_options`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9666
      
      Reviewed By: riversand963
      
      Differential Revision: D34680934
      
      Pulled By: ezgicicek
      
      fbshipit-source-id: b64ef18725fe523835d14ceb4b29bcdfe493f8ed
      27d6ef8e
    • G
      remove redundant assignment code for member state (#9665) · c9674364
      GuKaifeng 提交于
      Summary:
      Remove redundant assignment code for member `state` in the constructor of `ImmutableDBOptions`.
      There are two identical and redundant statements `stats = statistics.get();` in lines 740 and 748 of the code.
      This commit removed the line 740.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9665
      
      Reviewed By: ajkr
      
      Differential Revision: D34686649
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8f246ece382b6845528f4e2c843ce09bb66b2b0f
      c9674364
    • P
      Avoid .trash handling race in db_stress Checkpoint (#9673) · 4a9ae4f7
      Peter Dillinger 提交于
      Summary:
      The shared SstFileManager in db_stress can create background
      work that races with TestCheckpoint such that DestroyDir fails because
      of file rename while it is running. Analogous to change already made
      for TestBackupRestore
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9673
      
      Test Plan:
      make blackbox_crash_test for a while with
      checkpoint_one_in=100
      
      Reviewed By: ajkr
      
      Differential Revision: D34702215
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ac3e166efa28cba6c6f4b9b391e799394603ebfd
      4a9ae4f7
  6. 08 3月, 2022 5 次提交
  7. 05 3月, 2022 5 次提交
    • D
      Adding Social Banner in Support of Ukraine (#9652) · f20b6747
      Dmitry Vinnik 提交于
      Summary:
      Our mission at [Meta Open Source](https://opensource.facebook.com/) is to empower communities through open source, and we believe that it means building a welcoming and safe environment for all. As a part of this work, we are adding this banner in support for Ukraine during this crisis.
      
      ## Testing
      <img width="1080" alt="image" src="https://user-images.githubusercontent.com/12485205/156454047-9c153135-f3a6-41f7-adbe-8139759565ae.png">
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9652
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34647211
      
      Pulled By: dmitryvinn-fb
      
      fbshipit-source-id: b89cdc7eafcc58b1f503ee8e1939e43bffcb3b3f
      f20b6747
    • P
      Test refactoring for Backups+Temperatures (#9655) · ce60d0cb
      Peter Dillinger 提交于
      Summary:
      In preparation for more support for file Temperatures in BackupEngine,
      this change does some test refactoring:
      * Move DBTest2::BackupFileTemperature test to
      BackupEngineTest::FileTemperatures, with some updates to make it work
      in the new home. This test will soon be expanded for deeper backup work.
      * Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
      support sharing because of above moved test, but split off the "no link"
      part to the test needing it.
      * Use custom FileSystems in backupable_db_test rather than custom Envs,
      because going through Env file interfaces doesn't support temperatures.
      * Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
      parameter to FsyncWithDirOptions, which was required because this
      limitation caused a crash only after moving to higher fidelity of
      FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
      parameter details)
      * `backupable_options_` -> `engine_options_` as part of the ongoing
      work to get rid of the obsolete "backupable" naming.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
      
      Test Plan: test code updates only
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34622183
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
      ce60d0cb
    • H
      Attempt to deflake DBLogicalBlockSizeCacheTest.CreateColumnFamilies (#9516) · fc61e98a
      Hui Xiao 提交于
      Summary:
      **Context:**
      `DBLogicalBlockSizeCacheTest.CreateColumnFamilies` is flaky on a rare occurrence of assertion failure below
      ```
      db/db_logical_block_size_cache_test.cc:210
      Expected equality of these values:
        1
        cache_->GetRefCount(cf_path_0_)
          Which is: 2
      ```
      
      Root-cause: `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));` in the test may not successfully decrease the ref count of `cf_path_0_` since the decreasing only happens in the clean-up of `ColumnFamilyData` when `ColumnFamilyData` has no referencing to it, which may not be true when `db->DestroyColumnFamilyHandle(cfs[0])` is called since background work such as `DumpStats()` can hold reference to that `ColumnFamilyData` (suggested and repro-d by ajkr ). Similar case `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));`.
      
      See following for a deterministic repro:
      ```
       diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
      index 196b428a3..4e7a834c4 100644
       --- a/db/db_impl/db_impl.cc
      +++ b/db/db_impl/db_impl.cc
      @@ -956,10 +956,16 @@ void DBImpl::DumpStats() {
               // near-atomically.
               // Get a ref before unlocking
               cfd->Ref();
      +        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
      +          TEST_SYNC_POINT("DBImpl::DumpStats:PostCFDRef");
      +        }
               {
                 InstrumentedMutexUnlock u(&mutex_);
                 cfd->internal_stats()->CollectCacheEntryStats(/*foreground=*/false);
               }
      +        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
      +          TEST_SYNC_POINT("DBImpl::DumpStats::PreCFDUnrefAndTryDelete");
      +        }
               cfd->UnrefAndTryDelete();
             }
           }
       diff --git a/db/db_logical_block_size_cache_test.cc b/db/db_logical_block_size_cache_test.cc
      index 1057871c9..c3872c036 100644
       --- a/db/db_logical_block_size_cache_test.cc
      +++ b/db/db_logical_block_size_cache_test.cc
      @@ -9,6 +9,7 @@
       #include "env/io_posix.h"
       #include "rocksdb/db.h"
       #include "rocksdb/env.h"
      +#include "test_util/sync_point.h"
      
       namespace ROCKSDB_NAMESPACE {
       class EnvWithCustomLogicalBlockSizeCache : public EnvWrapper {
      @@ -183,6 +184,15 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
         ASSERT_EQ(1, cache_->GetRefCount(dbname_));
      
         std::vector<ColumnFamilyHandle*> cfs;
      +  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
      +  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
      +      {{"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH",
      +        "DBImpl::DumpStats:StartRunning"},
      +       {"DBImpl::DumpStats:PostCFDRef",
      +        "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH"},
      +       {"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
      +        "PostFinishCheckingRef",
      +        "DBImpl::DumpStats::PreCFDUnrefAndTryDelete"}});
         ASSERT_OK(db->CreateColumnFamilies(cf_options, {"cf1", "cf2"}, &cfs));
         ASSERT_EQ(2, cache_->Size());
         ASSERT_TRUE(cache_->Contains(dbname_));
      @@ -190,7 +200,7 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
         ASSERT_TRUE(cache_->Contains(cf_path_0_));
         ASSERT_EQ(2, cache_->GetRefCount(cf_path_0_));
         }
      
          // Delete one handle will not drop cache because another handle is still
         // referencing cf_path_0_.
      +  TEST_SYNC_POINT(
      +      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH");
      +  TEST_SYNC_POINT(
      +      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH");
         ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));
         ASSERT_EQ(2, cache_->Size());
         ASSERT_TRUE(cache_->Contains(dbname_));
      @@ -209,16 +221,20 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
         ASSERT_TRUE(cache_->Contains(cf_path_0_));
          // Will fail
         ASSERT_EQ(1, cache_->GetRefCount(cf_path_0_));
      
         // Delete the last handle will drop cache.
         ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));
         ASSERT_EQ(1, cache_->Size());
         ASSERT_TRUE(cache_->Contains(dbname_));
         // Will fail
         ASSERT_EQ(1, cache_->GetRefCount(dbname_));
      
      +  TEST_SYNC_POINT(
      +      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
      +      "PostFinishCheckingRef");
         delete db;
         ASSERT_EQ(0, cache_->Size());
         ASSERT_OK(DestroyDB(dbname_, options,
             {{"cf1", cf_options}, {"cf2", cf_options}}));
      +  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
       }
      ```
      
      **Summary**
      - Removed the flaky assertion
      - Clarified the comments for the test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9516
      
      Test Plan:
      - CI
      - Monitor for future flakiness
      
      Reviewed By: ajkr
      
      Differential Revision: D34055232
      
      Pulled By: hx235
      
      fbshipit-source-id: 9bf83ae5fa88bf6fc829876494d4692082e4c357
      fc61e98a
    • H
      Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654) · 4a776d81
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654
      
      Test Plan: - New unit test
      
      Reviewed By: pdillinger
      
      Differential Revision: D34622609
      
      Pulled By: hx235
      
      fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9
      4a776d81
    • A
      Avoid usage of ReopenWritableFile in db_stress (#9649) · 3362a730
      anand76 提交于
      Summary:
      The UniqueIdVerifier constructor currently calls ReopenWritableFile on
      the FileSystem, which might not be supported. Instead of relying on
      reopening the unique IDs file for writing, create a new file and copy
      the original contents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9649
      
      Test Plan: Run db_stress
      
      Reviewed By: pdillinger
      
      Differential Revision: D34572307
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3a777908582d79dae57488d4278bad126774f698
      3362a730
  8. 04 3月, 2022 1 次提交
    • J
      Improve build speed (#9605) · 67542bfa
      Jay Zhuang 提交于
      Summary:
      Improve the CI build speed:
      - split the macos tests to 2 parallel jobs
      - split tsan tests to 2 parallel jobs
      - move non-shm tests to nightly build
      - slow jobs use lager machine
      - fast jobs use smaller machine
      - add microbench to no-test jobs
      - add run-microbench to nightly build
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9605
      
      Reviewed By: riversand963
      
      Differential Revision: D34358982
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: d5091b3f4ef6d25c5c37920fb614f3342ee60e4a
      67542bfa
  9. 03 3月, 2022 4 次提交
    • Y
      Fix bug causing incorrect data returned by snapshot read (#9648) · 659a16d5
      Yanqin Jin 提交于
      Summary:
      This bug affects use cases that meet the following conditions
      - (has only the default column family or disables WAL) and
      - has at least one event listener
      - atomic flush is NOT affected.
      
      If the above conditions meet, then RocksDB can release the db mutex before picking all the
      existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
      number can still be incremented. The upcoming flush will ignore this snapshot.
      A later read using this snapshot can return incorrect result.
      
      To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
      creating snapshots during this interval.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D34555456
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
      659a16d5
    • Y
      Do not rely on ADL when invoking std::max_element (#9608) · 73fd589b
      Yuriy Chernyshov 提交于
      Summary:
      Certain STLs use raw pointers and ADL does not work for them.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9608
      
      Reviewed By: ajkr
      
      Differential Revision: D34583012
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7de6bbc8a080c3e7243ce0d758fe83f1663168aa
      73fd589b
    • J
      Fix corruption error when compressing blob data with zlib. (#9572) · 926ee138
      jingkai.yuan 提交于
      Summary:
      The plain data length may not be big enough if the compression actually expands data. So use deflateBound() to get the upper limit on the compressed output before deflate().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9572
      
      Reviewed By: riversand963
      
      Differential Revision: D34326475
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4b679cb7a83a62782a127785b4d5eb9aa4646449
      926ee138
    • J
      Unschedule manual compaction from thread-pool queue (#9625) · db864796
      Jay Zhuang 提交于
      Summary:
      PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction
      foreground thread and background compaction thread.
      This PR adds the ability to really unschedule manual compaction from
      thread-pool queue by differentiate tag name for manual compaction and
      other tasks.
      Also fix an issue that db `close()` didn't cancel the manual compaction thread.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625
      
      Test Plan: unittest not hang
      
      Reviewed By: ajkr
      
      Differential Revision: D34410811
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
      db864796
  10. 02 3月, 2022 9 次提交
  11. 01 3月, 2022 2 次提交
    • A
      Improve build detect for RISCV (#9366) · 7d7e88c7
      Adam Retter 提交于
      Summary:
      Related to: https://github.com/facebook/rocksdb/pull/9215
      
      * Adds build_detect_platform support for RISCV on Linux (at least on SiFive Unmatched platforms)
      
      This still leaves some linking issues on RISCV remaining (e.g. when building `db_test`):
      ```
      /usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `__gnu_cxx::new_allocator<char>::deallocate(char*, unsigned long)':
      /usr/include/c++/10/ext/new_allocator.h:133: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `std::__atomic_base<bool>::compare_exchange_weak(bool&, bool, std::memory_order, std::memory_order)':
      /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(memtable.o):/usr/include/c++/10/bits/atomic_base.h:464: more undefined references to `__atomic_compare_exchange_1' follow
      /usr/bin/ld: ./librocksdb_debug.a(db_impl.o): in function `rocksdb::DBImpl::NewIteratorImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, unsigned long, rocksdb::ReadCallback*, bool, bool)':
      /home/adamretter/rocksdb/db/db_impl/db_impl.cc:3019: undefined reference to `__atomic_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::Writer::CreateMutex()':
      /home/adamretter/rocksdb/./db/write_thread.h:205: undefined reference to `__atomic_compare_exchange_1'
      /usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::SetState(rocksdb::WriteThread::Writer*, unsigned char)':
      /home/adamretter/rocksdb/db/write_thread.cc:222: undefined reference to `__atomic_compare_exchange_1'
      collect2: error: ld returned 1 exit status
      make: *** [Makefile:1449: db_test] Error 1
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9366
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34377664
      
      Pulled By: mrambacher
      
      fbshipit-source-id: c86f9d0cd1cb0c18de72b06f1bf5847f23f51118
      7d7e88c7
    • A
      Handle failures in block-based table size/offset approximation (#9615) · 0a89cea5
      Andrew Kryczka 提交于
      Summary:
      In crash test with fault injection, we were seeing stack traces like the following:
      
      ```
      https://github.com/facebook/rocksdb/issues/3 0x00007f75f763c533 in __GI___assert_fail (assertion=assertion@entry=0x1c5b2a0 "end_offset >= start_offset", file=file@entry=0x1c580a0 "table/block_based/block_based_table_reader.cc", line=line@entry=3245,
      function=function@entry=0x1c60e60 "virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller)") at assert.c:101
      https://github.com/facebook/rocksdb/issues/4 0x00000000010ea9b4 in rocksdb::BlockBasedTable::ApproximateSize (this=<optimized out>, start=..., end=..., caller=<optimized out>) at table/block_based/block_based_table_reader.cc:3224
      https://github.com/facebook/rocksdb/issues/5 0x0000000000be61fb in rocksdb::TableCache::ApproximateSize (this=0x60f0000161b0, start=..., end=..., fd=..., caller=caller@entry=rocksdb::kCompaction, internal_comparator=..., prefix_extractor=...) at db/table_cache.cc:719
      https://github.com/facebook/rocksdb/issues/6 0x0000000000c3eaec in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, v=<optimized out>, f=..., start=..., end=..., caller=<optimized out>) at ./db/version_set.h:850
      https://github.com/facebook/rocksdb/issues/7 0x0000000000c6ebc3 in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, options=..., v=v@entry=0x621000047500, start=..., end=..., start_level=start_level@entry=0, end_level=<optimized out>, caller=<optimized out>)
      at db/version_set.cc:5657
      https://github.com/facebook/rocksdb/issues/8 0x000000000166e894 in rocksdb::CompactionJob::GenSubcompactionBoundaries (this=<optimized out>) at ./include/rocksdb/options.h:1869
      https://github.com/facebook/rocksdb/issues/9 0x000000000168c526 in rocksdb::CompactionJob::Prepare (this=this@entry=0x7f75f3ffcf00) at db/compaction/compaction_job.cc:546
      ```
      
      The problem occurred in `ApproximateSize()` when the index `Seek()` for the first `ApproximateDataOffsetOf()` encountered an I/O error, while the second `Seek()` did not. In the old code that scenario caused `start_offset == data_size` , thus it was easy to trip the assertion that `end_offset >= start_offset`.
      
      The fix is to set `start_offset == 0` when the first index `Seek()` fails, and `end_offset == data_size` when the second index `Seek()` fails. I doubt these give an "on average correct" answer for how this function is used, but I/O errors in index seeks are hopefully rare, it looked consistent with what was already there, and it was easier to calculate.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9615
      
      Test Plan:
      run the repro command for a while and stopped seeing coredumps -
      
      ```
      $ while !  ./db_stress --block_size=128 --cache_size=32768 --clear_column_family_one_in=0 --column_families=1 --continuous_verification_interval=0 --db=/dev/shm/rocksdb_crashtest --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --index_type=2 --iterpercent=10  --kill_random_test=18887 --max_key=1000000 --max_bytes_for_level_base=2048576 --nooverwritepercent=1 --open_files=-1 --open_read_fault_one_in=32 --ops_per_thread=1000000 --prefixpercent=5 --read_fault_one_in=0 --readpercent=45 --reopen=0 --skip_verifydb=1 --subcompactions=2 --target_file_size_base=524288 --test_batches_snapshots=0 --value_size_mult=32 --write_buffer_size=524288 --writepercent=35  ; do : ; done
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D34383069
      
      Pulled By: ajkr
      
      fbshipit-source-id: fac26c3b20ea962e75387515ba5f2724dc48719f
      0a89cea5