1. 12 5月, 2023 1 次提交
    • C
      Support compacting files to different temperatures in FIFO compaction (#11428) · 8827cd06
      Changyu Bi 提交于
      Summary:
      - Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable.
      - Deprecate the old option `age_for_warm` that was added for a similar purpose.
      - Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature.
      - Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428
      
      Test Plan: adapted unit tests that were for `age_for_warm` to this new option.
      
      Reviewed By: ajkr
      
      Differential Revision: D45611412
      
      Pulled By: cbi42
      
      fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106
      8827cd06
  2. 26 4月, 2023 1 次提交
    • C
      Block per key-value checksum (#11287) · 62fc15f0
      Changyu Bi 提交于
      Summary:
      add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
      1. checksum construction and verification in block.cc/h
      2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
      3. unit tests/crash test updates
      
      Tests:
      * Added unit tests
      * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
      
      Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
      
      Performance:
      Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
      For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
      
      ```
      SETUP
      make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
      ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
      
      BENCHMARK
      ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
      
      The readrandom ops/sec looks like the following:
      Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
      Main              240805     223604         198176           161653       139040
      PR prot_bytes=0   238691     226693         200127           161082       141153
      PR prot_bytes=1   214983     193199         178532           137013       108211
      prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
      prot_bytes=0
      ```
      
      The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
      
      Reviewed By: ajkr
      
      Differential Revision: D43970708
      
      Pulled By: cbi42
      
      fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
      62fc15f0
  3. 22 4月, 2023 1 次提交
    • H
      Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) · 151242ce
      Hui Xiao 提交于
      Summary:
      **Context:**
      The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
      
      **Summary**
      - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
         - Fixed the default histogram in `RandomAccessFileReader`
      - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
      - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
      
      Test Plan:
      - **Stress test**
      - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
           - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
      ```
      ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
      ```
      ```
      // BlockBasedTable
      rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
      rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
      rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
      
      // PlainTable
      Does not apply
      ```
      - **Db bench 2: performance**
      
      **Read**
      
      SETUP: db with 900 files
      ```
      ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
      ```run till convergence
      ```
      ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
      ```
      Pre-change
      `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
      Post-change (no regression, -0.3%)
      `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
      
      **Compaction/Flush**run till convergence
      ```
      ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
      
      rocksdb.sst.read.micros  COUNT : 33820
      rocksdb.sst.read.flush.micros COUNT : 1800
      rocksdb.sst.read.compaction.micros COUNT : 32020
      ```
      Pre-change
      `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Post-change (no regression, ~-0.4%)
      `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`
      
      Reviewed By: ajkr
      
      Differential Revision: D44007011
      
      Pulled By: hx235
      
      fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
      151242ce
  4. 19 3月, 2023 1 次提交
    • H
      New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) · cb584771
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Users are interested in figuring out what has caused write stall.
      - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`)
      - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager
      - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped`
      
      - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map
      - Misc: rename/comment some write stall InternalStats for clarity
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300
      
      Test Plan:
      - New UT
      - Stress test
      `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1`
      - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time.
      ```
      ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3
      ```
      pre-change:
      ```
      fillseq [AVG 15 runs] : 1176 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations;    0.5 MB/s
      fillseq [AVG 16 runs] : 1162 (± 685) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 17 runs] : 1136 (± 646) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 18 runs] : 1118 (± 610) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 19 runs] : 1100 (± 578) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations;    0.3 MB/s
      fillseq [AVG 20 runs] : 1071 (± 551) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 21 runs] : 1059 (± 525) ops/sec;    0.5 (± 0.3) MB/sec
      ```
      post-change:
      ```
      fillseq [AVG 15 runs] : 1226 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 16 runs] : 1196 (± 687) ops/sec;    0.6 (± 0.4) MB/sec
      fillseq      :    1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 17 runs] : 1174 (± 647) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 18 runs] : 1156 (± 611) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 19 runs] : 1134 (± 579) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 20 runs] : 1119 (± 550) ops/sec;    0.6 (± 0.3) MB/sec
      fillseq      :    1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations;    0.4 MB/s
      fillseq [AVG 21 runs] : 1106 (± 524) ops/sec;    0.6 (± 0.3) MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D44159541
      
      Pulled By: hx235
      
      fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
      cb584771
  5. 28 1月, 2023 1 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
  6. 25 1月, 2023 1 次提交
    • H
      Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job... · 86fa2592
      Hui Xiao 提交于
      Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)
      
      Summary:
      **Context:**
      Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush  `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `
      
      **Summary:**
      Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)
      
      **Tets:**
      - new unit test
      - make check
      - aggressive crash test rehearsal
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111
      
      Reviewed By: ajkr
      
      Differential Revision: D42644600
      
      Pulled By: hx235
      
      fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf
      86fa2592
  7. 30 12月, 2022 1 次提交
  8. 14 12月, 2022 1 次提交
    • H
      Sort L0 files by newly introduced epoch_num (#10922) · 98d5db5c
      Hui Xiao 提交于
      Summary:
      **Context:**
      Sorting L0 files by `largest_seqno` has at least two inconvenience:
      -  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
          - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
             - insert k1@1 to memtable m1
             - ingest file s1 with k2@2, ingest file s2 with k3@3
              - insert k4@4 to m1
             - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
             - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
          - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
      - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
          - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
              - an existing SST s1 contains only k1@1
              - insert k1@2 to memtable m1
              - ingest file s2 with k3@3, ingest file s3 with k4@4
              - insert single delete k5@5 in m1
              - flush m1 and result in new file s4 of seqno range [2, 5]
              - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
              - compact s4 and result in new file s6 of seqno range [2] due to single delete
          - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
      
      Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
      - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
      - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
      
      **Summary:**
      - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
        - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
        - Compaction output file  is assigned with the minimum `epoch_number` among input files'
            - Refit level: reuse refitted file's epoch_number
        -  Other paths needing `epoch_number` treatment:
           - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
           - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
        -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
        - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
      - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
         - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
         - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
      - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
         - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
      - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
      - Misc:
         - update existing tests with `epoch_number` so make check will pass
         - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
         - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
      
      Test Plan:
      - `make check`
      - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
      - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
      - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
      - [Ongoing] normal db stress test
      - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
      
      Reviewed By: ajkr
      
      Differential Revision: D41063187
      
      Pulled By: hx235
      
      fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
      98d5db5c
  9. 30 11月, 2022 1 次提交
  10. 03 11月, 2022 1 次提交
    • A
      Ran clang-format on db/ directory (#10910) · 5cf6ab6f
      Andrew Kryczka 提交于
      Summary:
      Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
      
      Reviewed By: riversand963
      
      Differential Revision: D40880683
      
      Pulled By: ajkr
      
      fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
      5cf6ab6f
  11. 26 10月, 2022 1 次提交
    • H
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos... · fc74abb4
      Hui Xiao 提交于
      Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
      
      Summary:
      **Context:**
      Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
      Repro:
      ```
      COERCE_CONTEXT_SWICH=1 make -j56 db_stress
      
      ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
      
      put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
      ```
      
      **Summary:**
      FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
      -  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
         - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
         - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
      - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
        - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
        - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
      - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
      - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
          - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
      
      Some additional clean-up included in this PR:
      - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
      - Added comment about `earliest_memtable_seqno` in related APIs
      - Made parameter `earliest_memtable_seqno` constant and required
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
      
      Test Plan:
      - make check
      - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
      - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only
      
      Reviewed By: ajkr
      
      Differential Revision: D40090485
      
      Pulled By: hx235
      
      fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
      fc74abb4
  12. 19 10月, 2022 1 次提交
    • Y
      Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348) · e267909e
      Yueh-Hsuan Chiang 提交于
      Summary:
      FIFO compaction can theoretically open a DB with any compaction style.
      However, the current code only allows FIFO compaction to open a DB with
      a single level.
      
      This PR relaxes the limitation of FIFO compaction and allows it to open a
      DB with multiple levels.  Below is the read / write / compaction behavior:
      
      * The read behavior is untouched, and it works like a regular rocksdb instance.
      * The write behavior is untouched as well.  When a FIFO compacted DB
      is opened with multiple levels, all new files will still be in level 0, and no files
      will be moved to a different level.
      * Compaction logic is extended.  It will first identify the bottom-most non-empty level.
      Then, it will delete the oldest file in that level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348
      
      Test Plan:
      Added a new test to verify the migration from level to FIFO where the db has multiple levels.
      Extended existing test cases in db_test and db_basic_test to also verify
      all entries of a key after reopening the DB with FIFO compaction.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40233744
      
      fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
      e267909e
  13. 06 10月, 2022 1 次提交
    • Y
      Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773) · 4d82b948
      Yanqin Jin 提交于
      Summary:
      With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773
      
      Test Plan:
      Reproduce: D39993203 has detailed steps.
      Run the test with and without the fix.
      
      Reviewed By: cbi42
      
      Differential Revision: D40077955
      
      Pulled By: cbi42
      
      fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
      4d82b948
  14. 22 9月, 2022 1 次提交
    • C
      Fix memtable-only iterator regression (#10705) · 749b849a
      Changyu Bi 提交于
      Summary:
      when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705
      
      Test Plan:
      - `make check`
      - Performance:
        - Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
        - Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
      ```
      seek_nexts    main op/sec    https://github.com/facebook/rocksdb/issues/10705      RocksDB v7.6
      0             5746568        5749033     5786180
      30            2411690        3006466     2837699
      1000          102556         128902      124667
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D39644221
      
      Pulled By: cbi42
      
      fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
      749b849a
  15. 01 9月, 2022 1 次提交
  16. 06 8月, 2022 1 次提交
    • C
      Fragment memtable range tombstone in the write path (#10380) · 9d77bf8f
      Changyu Bi 提交于
      Summary:
      - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
      - db_bench is updated to print out the number of range deletions executed if there is any.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
      
      Test Plan:
      - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
      - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.
      
      ```
      single thread:
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
      
      multi_thread
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
      ```
      Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
      Results are averaged over 5 runs.
      
      Single thread result:
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
      | 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
      | 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
      | 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
      | 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |
      
      32-thread result: note that "Max # tombstones" is per thread.
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
      | 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
      | 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
      | 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |
      
      Reviewed By: ajkr
      
      Differential Revision: D37916564
      
      Pulled By: cbi42
      
      fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
      9d77bf8f
  17. 03 8月, 2022 1 次提交
    • P
      Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59
      Peter Dillinger 提交于
      Summary:
      TL;DR: due to a recent change, if you drop a column family,
      often that DB will no longer fsync after writing new SST files
      to remaining or new column families, which could lead to data
      loss on power loss.
      
      More bug detail:
      The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
      DB::Close time rather than waiting for DB object destruction.
      Unfortunately, it also closes shared FSDirectory objects on
      DropColumnFamily (& destroy remaining handles), which can lead
      to use-after-Close on FSDirectory shared with remaining column
      families. Those "uses" are only Fsyncs (or redundant Closes). In
      the default Posix filesystem, an Fsync on a closed FSDirectory is a
      quiet no-op. Consequently (under most configurations), if you drop
      a column family, that DB will no longer fsync after writing new SST
      files to column families sharing the same directory (true under most
      configurations).
      
      More fix detail:
      Basically, this removes unnecessary Close ops on destroying
      ColumnFamilyData. We let `shared_ptr` take care of calling the
      destructor at the right time. If the intent was to require Close be
      called before destroying FSDirectory, that was not made clear by the
      author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
      could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
      not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
      timely destruction of FSDirectory to suffice as Close (in
      CountedFileSystem). Added a TODO to revisit that.
      
      Also in this PR:
      * Added a TODO to share FSDirectory instances between DB and its column
      families. (Already shared among column families.)
      * Made DB::Close attempt to close all its open FSDirectory objects even
      if there is a failure in closing one. Also code clean-up around this
      logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
      
      Test Plan:
      add an assert to check for use-after-Close. With that
      existing tests can detect the misuse. With fix, tests pass (except noted
      relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
      
      Reviewed By: ajkr
      
      Differential Revision: D38357922
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
      27f3af59
  18. 23 7月, 2022 1 次提交
  19. 30 6月, 2022 1 次提交
    • S
      Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270) · 61152544
      sdong 提交于
      Summary:
      Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.
      
      Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270
      
      Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.
      
      Reviewed By: ajkr
      
      Differential Revision: D37497327
      
      fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
      61152544
  20. 24 6月, 2022 1 次提交
    • B
      Dynamically changeable `MemPurge` option (#10011) · 5879053f
      Baptiste Lemaire 提交于
      Summary:
      **Summary**
      Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
      
      **Motivation**
      RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
      Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
      
      **Content of this PR**
      This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
      
      **Benchmarking**
      I will add numbers to prove that there is no performance impact within the next 12 hours.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
      
      Reviewed By: pdillinger
      
      Differential Revision: D36462357
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
      5879053f
  21. 21 6月, 2022 1 次提交
    • G
      Add blob source to retrieve blobs in RocksDB (#10198) · deff48bc
      Gang Liao 提交于
      Summary:
      There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
      In this task, we formally introduced the blob source to RocksDB.  BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
      
      Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
      
      Reviewed By: ltamasi
      
      Differential Revision: D37294735
      
      Pulled By: gangliao
      
      fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
      deff48bc
  22. 15 6月, 2022 1 次提交
    • H
      Account memory of FileMetaData in global memory limit (#9924) · d665afdb
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
      
      Test Plan:
      - Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
      - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
      - db bench (CPU cost of `charge_file_metadata` in write and compact)
         - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
         - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
      
      table 1 - write
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
      80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
      160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
      
      table 2 - compact
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
      20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
      40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
      80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
      
      - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36055583
      
      Pulled By: hx235
      
      fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
      d665afdb
  23. 08 6月, 2022 1 次提交
    • Z
      Handle "NotSupported" status by default implementation of Close() in … (#10127) · b6de139d
      zczhu 提交于
      Summary:
      The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127
      
      Reviewed By: ajkr
      
      Differential Revision: D36971112
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466
      b6de139d
  24. 03 6月, 2022 1 次提交
  25. 02 6月, 2022 1 次提交
    • Z
      Explicitly closing all directory file descriptors (#10049) · 65893ad9
      Zichen Zhu 提交于
      Summary:
      Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
      ```
      strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
      ```
      There is one directory file descriptor that is not closed in the strace log.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
      
      Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
      
      Reviewed By: ajkr, hx235
      
      Differential Revision: D36722135
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
      65893ad9
  26. 21 5月, 2022 1 次提交
    • C
      Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) · cc23b46d
      Changyu Bi 提交于
      Summary:
      An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
      
      Test Plan:
      #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
      Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
      ```
      # linked local ZSTD with version 1.5.2
      # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
      
      dict_bytes=16384
      train_bytes=1048576
      echo "========== No Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== Raw Content Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== FinalizeDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== TrainDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
      # before compression data size: 1.2GB
      dict_bytes=16384
      max_dict_buffer_bytes =  1048576
                          space   cpu/memory
      No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
      Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
      FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
      TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
      ```
      
      #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
      FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
      ```
      dict_bytes=16384
      train_bytes=1048576
      
      for sst_file in `ls ../temp/myrock-sst/`
      do
        echo "********** $sst_file **********"
        echo "========== No Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
      
        echo "========== Raw Content Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
      
        echo "========== FinalizeDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
      
        echo "========== TrainDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
      done
      
                               010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
      No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
      Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
      FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
      TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
      ```
      
      #### Decompression/Read test:
      With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
      ```
      dict_bytes=16384
      train_bytes=1048576
      echo "No Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
      
      echo "Raw Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
      
      echo "FinalizeDict"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
      
      echo "Train Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
      
      No Dictionary
      readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
      Raw Dictionary
      readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
      FinalizeDict
      readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
      Train Dictionary
      readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D35720026
      
      Pulled By: cbi42
      
      fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
      cc23b46d
  27. 06 5月, 2022 1 次提交
    • S
      Use std::numeric_limits<> (#9954) · 49628c9a
      sdong 提交于
      Summary:
      Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
      
      Test Plan: See CI Runs.
      
      Reviewed By: riversand963
      
      Differential Revision: D36173954
      
      fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
      49628c9a
  28. 25 3月, 2022 1 次提交
    • P
      Fix heap use-after-free race with DropColumnFamily (#9730) · cad80997
      Peter Dillinger 提交于
      Summary:
      Although ColumnFamilySet comments say that DB mutex can be
      freed during iteration, as long as you hold a ref while releasing DB
      mutex, this is not quite true because UnrefAndTryDelete might delete cfd
      right before it is needed to get ->next_ for the next iteration of the
      loop.
      
      This change solves the problem by making a wrapper class that makes such
      iteration easier while handling the tricky details of UnrefAndTryDelete
      on the previous cfd only after getting next_ in operator++.
      
      FreeDeadColumnFamilies should already have been obsolete; this removes
      it for good. Similarly, ColumnFamilySet::iterator doesn't need to check
      for cfd with 0 refs, because those are immediately deleted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730
      
      Test Plan:
      was reported with ASAN on unit tests like
      DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching
      
      Reviewed By: ltamasi
      
      Differential Revision: D35038143
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9
      cad80997
  29. 13 3月, 2022 2 次提交
  30. 12 3月, 2022 1 次提交
    • S
      Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) · 95305c44
      slk 提交于
      Summary:
      As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API  named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.
      
      And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410
      
      Reviewed By: ltamasi
      
      Differential Revision: D34410207
      
      Pulled By: riversand963
      
      fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
      95305c44
  31. 25 11月, 2021 1 次提交
    • P
      Fix bug affecting GetSortedWalFiles, Backups, Checkpoint (#9208) · 2a67d475
      Peter Dillinger 提交于
      Summary:
      Saw error like this:
      `Backup failed -- IO error: No such file or directory: While opening a
      file for sequentially reading:
      /dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
      directory`
      
      Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
      relies on no file deletions happening while its operating, which
      means not only disabling (more) deletions, but ensuring any pending
      deletions are completed. Two fixes related to this:
      
      * There was a gap in several places between decrementing
      pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
      the db mutex would be released and GetSortedWalFiles (and others) could
      get false information that no deletions are pending.
      
      * The fix to https://github.com/facebook/rocksdb/issues/8591 (disabling deletions in GetSortedWalFiles) seems
      incomplete because it doesn't prevent pending deletions from occuring
      during the operation (if deletions not already disabled, the case that
      was to be fixed by the change).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9208
      
      Test Plan:
      existing tests (it's hard to write a test for interleavings
      that are now excluded - this is what stress test is for)
      
      Reviewed By: ajkr
      
      Differential Revision: D32630675
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
      2a67d475
  32. 16 11月, 2021 1 次提交
    • Y
      Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162) · 20357988
      Yanqin Jin 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
      
      Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
      conflict checking. If user-defined timestamp is enabled, it should perform
      conflict checking based on timestamps too.
      
      Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
      latest version of a key is smaller than the read timestamp. Note that
      CheckKeysForConflict() is not updated since it's used only by optimistic
      transaction, and we do not plan to update it in this upcoming batch of diffs.
      
      Existing GetLatestSequenceForKey() returns the sequence of the latest
      version of a specific user key. Since we support user-defined timestamp, we
      need to update this method to also return the timestamp (if enabled) of the
      latest version of the key. This will be needed for snapshot validation.
      
      Reviewed By: ltamasi
      
      Differential Revision: D31567960
      
      fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
      20357988
  33. 12 10月, 2021 1 次提交
    • L
      Make it possible to force the garbage collection of the oldest blob files (#8994) · 3e1bf771
      Levi Tamasi 提交于
      Summary:
      The current BlobDB garbage collection logic works by relocating the valid
      blobs from the oldest blob files as they are encountered during compaction,
      and cleaning up blob files once they contain nothing but garbage. However,
      with sufficiently skewed workloads, it is theoretically possible to end up in a
      situation when few or no compactions get scheduled for the SST files that contain
      references to the oldest blob files, which can lead to increased space amp due
      to the lack of GC.
      
      In order to efficiently handle such workloads, the patch adds a new BlobDB
      configuration option called `blob_garbage_collection_force_threshold`,
      which signals to BlobDB to schedule targeted compactions for the SST files
      that keep alive the oldest batch of blob files if the overall ratio of garbage in
      the given blob files meets the threshold *and* all the given blob files are
      eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
      if the new option is set to 0.9, targeted compactions will get scheduled if the
      sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
      oldest blob files, assuming all affected blob files are below the age-based cutoff.)
      The net result of these targeted compactions is that the valid blobs in the oldest
      blob files are relocated and the oldest blob files themselves cleaned up (since
      *all* SST files that rely on them get compacted away).
      
      These targeted compactions are similar to periodic compactions in the sense
      that they force certain SST files that otherwise would not get picked up to undergo
      compaction and also in the sense that instead of merging files from multiple levels,
      they target a single file. (Note: such compactions might still include neighboring files
      from the same level due to the need of having a "clean cut" boundary but they never
      include any files from any other level.)
      
      This functionality is currently only supported with the leveled compaction style
      and is inactive by default (since the default value is set to 1.0, i.e. 100%).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
      
      Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
      
      Reviewed By: riversand963
      
      Differential Revision: D31489850
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
      3e1bf771
  34. 08 10月, 2021 1 次提交
  35. 29 9月, 2021 1 次提交
    • M
      Cleanup includes in dbformat.h (#8930) · 13ae16c3
      mrambacher 提交于
      Summary:
      This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.
      
      Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
      
      Hopefully, this sort of code hygiene cleanup will speed up the builds...
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
      
      Reviewed By: pdillinger
      
      Differential Revision: D31142788
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
      13ae16c3
  36. 09 9月, 2021 1 次提交
    • Z
      Add DB properties for BlobDB (#8734) · 0cb0fc6f
      Zhiyi Zhang 提交于
      Summary:
      RocksDB exposes certain internal statistics via the DB property interface.
      However, there are currently no properties related to BlobDB.
      
      For starters, we would like to add the following BlobDB properties:
      `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
      `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
      `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
      `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
      `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734
      
      Test Plan:
      ```
      ➜  rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
      [==========] Running 16 tests from 2 test cases.
      [----------] Global test environment set-up.
      [----------] 10 tests from DBBlobBasicTest
      [ RUN      ] DBBlobBasicTest.GetBlob
      [       OK ] DBBlobBasicTest.GetBlob (12 ms)
      [ RUN      ] DBBlobBasicTest.MultiGetBlobs
      [       OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_CorruptIndex
      [       OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
      [       OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
      [       OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
      [ RUN      ] DBBlobBasicTest.GenerateIOTracing
      [       OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
      [ RUN      ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
      [       OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
      [ RUN      ] DBBlobBasicTest.GetMergeBlobWithPut
      [       OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
      [ RUN      ] DBBlobBasicTest.MultiGetMergeBlobWithPut
      [       OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
      [ RUN      ] DBBlobBasicTest.BlobDBProperties
      [       OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
      [----------] 10 tests from DBBlobBasicTest (124 ms total)
      
      [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
      [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
      
      [----------] Global test environment tear-down
      [==========] 16 tests from 2 test cases ran. (2190 ms total)
      [  PASSED  ] 16 tests.
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D30690849
      
      Pulled By: Zhiyi-Zhang
      
      fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
      0cb0fc6f
  37. 08 9月, 2021 1 次提交
    • M
      Make MemTableRepFactory into a Customizable class (#8419) · beed8647
      mrambacher 提交于
      Summary:
      This PR does the following:
      -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
      -> Makes the existing implementations compatible with configurations
      -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
      
      New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29558961
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
      beed8647
  38. 02 9月, 2021 1 次提交
    • P
      Remove some unneeded code (#8736) · c9cd5d25
      Peter Dillinger 提交于
      Summary:
      * FullKey and ParseFullKey appear to serve no purpose in the public API
      (or anything else) so removed. Only use in one test updated.
      * NumberToString serves no purpose vs. ToString so removed, numerous
      calls updated
      * Remove unnecessary forward declarations in metadata.h by re-arranging
      class definitions.
      * Remove some unneeded semicolons
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
      
      Test Plan: existing tests
      
      Reviewed By: mrambacher
      
      Differential Revision: D30700039
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
      c9cd5d25
  39. 03 8月, 2021 1 次提交
    • L
      Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605) · 3f7e9298
      Levi Tamasi 提交于
      Summary:
      The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
      unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
      the per-thread `SuperVersion` pointers when the only remaining reference
      is the back reference from `super_version_`. The idea behind this was to
      break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
      when the penultimate reference goes away, `ColumnFamilyData` can clean up
      the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
      is a `SuperVersion` and it is not referenced by anything else.) However,
      unlocking the mutex throws a wrench in this plan by making it possible for another thread
      to jump in and take another reference to the `ColumnFamilyData`, keeping the
      object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
      https://github.com/facebook/rocksdb/issues/8440 ,
      https://github.com/facebook/rocksdb/issues/8382 ,
      and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
      destructor we sometimes observe during close in our stress tests.
      
      Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
      rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
      `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
      https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
      unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
      the circular dependency and associated cleanup logic described above (in order
      to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
      and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
      Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
      apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
      in `SuperVersion::Cleanup`.
      
      The patch simply eliminates the unlocking and relocking, which has been unnecessary
      ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
      This closes the window during which another thread could increase the reference count,
      and hopefully fixes the issues above.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
      
      Test Plan: Ran `make check` and stress tests locally.
      
      Reviewed By: pdillinger
      
      Differential Revision: D30051035
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
      3f7e9298