1. 27 6月, 2023 2 次提交
  2. 24 6月, 2023 2 次提交
    • C
      Add CreateColumnFamilyWithImport to `StackableDB` and `DBImplReadOnly` (#11556) · ca50ccc7
      Changyu Bi 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556
      
      Test Plan: check if this fixes an internal build
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D46980506
      
      Pulled By: cbi42
      
      fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc
      ca50ccc7
    • A
      Add an interface to provide support for underlying FS to pass their own buffer... · fbd2f563
      akankshamahajan 提交于
      Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324)
      
      Summary:
      1. Public API change: Replace `use_async_io`  API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.
      
      2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324
      
      Test Plan: Tested locally
      
      Reviewed By: anand1976
      
      Differential Revision: D44465322
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec
      fbd2f563
  3. 23 6月, 2023 1 次提交
  4. 22 6月, 2023 3 次提交
    • P
      Use FaultInjectionTestFS in transaction_test, clarify Close() APIs (#11499) · 05a1d52e
      Peter Dillinger 提交于
      Summary:
      ... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271
      
      Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts.
      
      Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499
      
      Test Plan: watch CI
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D46375708
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40
      05a1d52e
    • Y
      Record the `persist_user_defined_timestamps` flag in manifest (#11515) · 7521478b
      Yu Zhang 提交于
      Summary:
      Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515
      
      Test Plan:
      ```
      make all check
      ./version_edit_test
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D46920386
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26
      7521478b
    • P
      Internal API for generating semi-random salt (#11331) · 98c6d7fd
      Peter Dillinger 提交于
      Summary:
      ... so that a non-cryptographic whole file checksum would be highly resistant
      to manipulation by a user able to manipulate key-value data (e.g. a user whose data is
      stored in RocksDB) and able to predict SST metadata such as DB session id and file
      number based on read access to logs or DB files. The adversary would also need to predict
      the salt in order to influence the checksum result toward collision with another file's
      checksum.
      
      This change is just internal code to support such a future feature. I think this should be a
      passive feature, not option-controlled, because you probably won't think about needing it
      until you discover you do need it, and it should be low cost, in space (16 bytes per SST
      file) and CPU.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11331
      
      Test Plan: Unit tests added to verify at least pseudorandom behavior. (Actually caught a bug in first draft!) The new "stress" style tests run in ~3ms each on my system.
      
      Reviewed By: ajkr
      
      Differential Revision: D46129415
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7972dc74487e062b29b1fd9c227425e922c98796
      98c6d7fd
  5. 21 6月, 2023 1 次提交
  6. 20 6月, 2023 2 次提交
    • L
      Attempt to deflake DBWALTestWithEnrichedEnv.SkipDeletedWALs (#11537) · 022d8954
      Levi Tamasi 提交于
      Summary:
      Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537
      
      Test Plan:
      ```
      gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="*SkipDeletedWALs*"
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D46736050
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e
      022d8954
    • L
      Initialize StressTest::optimistic_txn_db_ in ctor (#11547) · b3edb873
      Levi Tamasi 提交于
      Summary:
      `StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which
      can lead to assertion failures down the line in `StressTest::Open`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547
      
      Reviewed By: cbi42
      
      Differential Revision: D46845658
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2
      b3edb873
  7. 18 6月, 2023 1 次提交
    • J
      Stress/Crash Test for OptimisticTransactionDB (#11513) · 17d52005
      Jay Huh 提交于
      Summary:
      Context:
      OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB.
      1. Adding the following gflag options to to test OptimisticTransactionDB
      - `use_optimistic_txn`: When true, open OptimisticTransactionDB to test
      - `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default.
      - `share_occ_lock_buckets`: Use shared occ locks
      - `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock.
      2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc`
      3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk
      
      Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513
      
      Test Plan:
      **Crash Test**
      Serial Validation
      ```
      export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0"
      make crash_test -j
      ```
      Parallel Validation (no share bucket)
      ```
      export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0"
      make crash_test -j
      ```
      Parallel Validation (share bucket)
      ```
      export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500"
      make crash_test -j
      ```
      
      **Stress Test**
      ```
      ./db_stress -use_optimistic_txn -threads=32
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D46547387
      
      Pulled By: jaykorean
      
      fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960
      17d52005
  8. 17 6月, 2023 3 次提交
    • H
      Add UT to test BG read qps behavior during upgrade for pr11406 (#11522) · 1da9ac23
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      When db is upgrading to adopt [pr11406](https://github.com/facebook/rocksdb/pull/11406/), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen.
      
      Misc: refactor the related UTs a bit to make this new UT more readable.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11522
      
      Test Plan:
      - New UT
      If logic of upgrade is wrong e.g,
      ```
       --- a/table/block_based/partitioned_index_reader.cc
      +++ b/table/block_based/partitioned_index_reader.cc
      @@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies(
         uint64_t prefetch_len = last_off - prefetch_off;
         std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
         if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
      -      tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
      +      (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) {
      ```
      , then the UT will fail like below
      ```
      [ RUN      ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0
      file/prefetch_test.cc:461: Failure
      Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33
      Received signal 11 (Segmentation fault)
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D46546707
      
      Pulled By: hx235
      
      fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e
      1da9ac23
    • Y
      Fix error case memory bug in GetHostName() (#11544) · 66499780
      Yu Zhang 提交于
      Summary:
      Fix the error handling in `GetHostName` for non EFAULT, non EINVAL error. Current handling will cause stack overflow when non null-terminated c style string is in `name`, e.g. ENAMETOOLONG, when the `name` buffer is not big enough and the host name is truncated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11544
      
      Test Plan:
      ```
      COMPILE_WITH_ASAN=1 make all check
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D46775799
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: e0fc9400c50fe38bc1fd888b4fea5fe8706165bf
      66499780
    • Y
      Add a ticker to track number of trash files deleted in background thread (#11540) · b421a8c2
      Yu Zhang 提交于
      Summary:
      This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540
      
      Test Plan:
      ```
      ./delete_scheduler_test
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D46746401
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df
      b421a8c2
  9. 16 6月, 2023 4 次提交
  10. 15 6月, 2023 2 次提交
    • P
      Avoid destroying default PosixEnv, safely (#11538) · 70bf5ef0
      Peter Dillinger 提交于
      Summary:
      Use another static object to join threads instead.
      
      This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening
      during static destruction.
      
      I didn't see anything else in PosixEnv or base classes that would cause a problem by not
      destroying. (WinEnv is already not destroyed; see env_default.cc)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in
      $
      ```
      
      Test Plan:
      test added, which would previously fail with UBSAN:
      
      ```
      $ ./env_test --gtest_filter=*Destruct*
      Note: Google Test filter = *Destruct*
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from EnvTestMisc
      [ RUN      ] EnvTestMisc.StaticDestruction
      [       OK ] EnvTestMisc.StaticDestruction (0 ms)
      [----------] 1 test from EnvTestMisc (0 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (0 ms total)
      [  PASSED  ] 1 test.
      env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env'
      0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE'
       00 00 00 00  90 a7 f7 95 7b 7f 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
                    ^~~~~~~~~~~~~~~~~~~~~~~
                    vptr for 'N7rocksdb12ConfigurableE'
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D46737389
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b
      70bf5ef0
    • C
      Do not include last level in compaction when `allow_ingest_behind=true` (#11489) · 15e8a843
      Changyu Bi 提交于
      Summary:
      when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489
      
      Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction.
      
      Reviewed By: ajkr
      
      Differential Revision: D46455711
      
      Pulled By: cbi42
      
      fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c
      15e8a843
  11. 14 6月, 2023 3 次提交
  12. 13 6月, 2023 1 次提交
  13. 10 6月, 2023 1 次提交
    • I
      statistics.cc: fix mistype (#11509) · 7c67aee4
      Ignat Loskutov 提交于
      Summary:
      Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509
      
      Reviewed By: ltamasi
      
      Differential Revision: D46542809
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e
      7c67aee4
  14. 09 6月, 2023 2 次提交
  15. 08 6月, 2023 1 次提交
  16. 07 6月, 2023 3 次提交
    • H
      Fix higher read qps during db open caused by pr 11406 (#11516) · 3093d98c
      Hui Xiao 提交于
      Summary:
      **Context:**
      [PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
      - [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
      -  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in
      
      **Summary:**
      - Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
      - Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516
      
      Test Plan:
      - db bench
      
      Create db with small files prior to PR 11406
      ```
      ./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
      ```
      Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
      ```
      ./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
      ```
      Pre-PR:
      ```
      rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
      ```
      
      Post-PR:
      ```
      rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
      ```
      
      _Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
      ```
      class PosixRandomAccessFile : public FSRandomAccessFile {
      virtual size_t GetRequiredBufferAlignment() const override {
      -    return logical_sector_size_;
      +    return 1;
        }
       ```
      
      - CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D46472566
      
      Pulled By: hx235
      
      fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
      3093d98c
    • C
      Fix subcompaction bug to allow running two subcompactions (#11501) · 2e8cc98a
      Changyu Bi 提交于
      Summary:
      as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501
      
      Test Plan:
      * Add a new unit test.
      * Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple  --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0`
        * saw logs showing compactions being executed as 2 subcompactions
      ```
      2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]}
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D46411497
      
      Pulled By: cbi42
      
      fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258
      2e8cc98a
    • L
      IterKey: change space_[32] to 39 to utilize padding space (#10633) · ddfcbea3
      leipeng 提交于
      Summary:
      This PR utilize padding space.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10633
      
      Reviewed By: pdillinger
      
      Differential Revision: D46410978
      
      Pulled By: ajkr
      
      fbshipit-source-id: 23ec757b1eea9221c1390971e39d341c6b7f2003
      ddfcbea3
  17. 06 6月, 2023 3 次提交
  18. 03 6月, 2023 2 次提交
    • A
      Small improvements to DBGet microbenchmark (#11498) · 687a2a0d
      Andrew Kryczka 提交于
      Summary:
      Follow a couple best practices:
      
      - Allowed Google benchmark to decide number of iterations. Previously we hardcoded a value, which circumvented benchmark's heuristic for iterating until the result is stable.
      - Made each iteration do similar work. Previously, an iteration could do different work depending if the key was found in the first, second, third, or no L0 file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11498
      
      Test Plan: none as I am unable to prove it is better
      
      Reviewed By: hx235
      
      Differential Revision: D46339050
      
      Pulled By: ajkr
      
      fbshipit-source-id: fcfc6da4111c5b3ae86d79d908afc5f61f96675b
      687a2a0d
    • P
      Some fixes to unreleased_history/ (#11504) · 7a9b264f
      Peter Dillinger 提交于
      Summary:
      * Add a "Performance Improvements" section
      * Add required copyright headers
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11504
      
      Test Plan: manual
      
      Reviewed By: hx235
      
      Differential Revision: D46405128
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4f878dfd0170d381d3051a44c13479c860e812c0
      7a9b264f
  19. 02 6月, 2023 3 次提交
    • C
      Log correct compaction score for Universal Compaction (#11487) · 71ca9a1d
      Changyu Bi 提交于
      Summary:
      currently 0 is incorrectly logged as the compaction score for L0 when num_levels > 1. This PR fixes the issue to log the correct score.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11487
      
      Test Plan:
      ```
      ./db_bench --benchmarks=fillrandom --max_background_jobs=8 --num=1000000  --compaction_style=1 --stats_dump_period_sec=20 --num_levels=7 --write_buffer_size=1048576
      
      grep "L0   " /tmp/rocksdbtest-543376/dbbench/LOG
      
      before:
      ** Compaction Stats [default] **
      Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
      ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      9.9      0.42              0.33         9    0.046       0      0       0.0       0.0
      L0      3/1    1.37 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.1      0.6      9.6      3.76              3.03        76    0.050     34K    140       0.0       0.0
      L0      2/0    2.26 MB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.6      3.2      8.2     12.59             11.17       163    0.077    619K   5499       0.0       0.0
      
      after: compaction scores are non-zero
      L0      0/0    0.00 KB   0.8      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      9.6      0.43              0.34         9    0.048       0      0       0.0       0.0
      L0      2/1   937.08 KB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   1.1      0.6      9.3      3.85              3.07        75    0.051     34K    165       0.0       0.0
      L0      2/2    1.82 MB   1.0      0.0     0.0      0.0       0.1      0.1       0.0   1.6      3.0      8.0     12.45             10.99       160    0.078    577K   5399       0.0       0.0
      
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D46293993
      
      Pulled By: cbi42
      
      fbshipit-source-id: 19753f7df68c5f54a84c4ed52794f83e510c9721
      71ca9a1d
    • C
      `CompactRange()` always compacts to bottommost level for leveled compaction (#11468) · e95cc121
      Changyu Bi 提交于
      Summary:
      currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: https://github.com/cbi42/rocksdb/blob/796f58f42ad1bdbf49e5fcf480763f11583b790e/db/db_impl/db_impl_compaction_flush.cc#L1180C29-L1184).
      
      This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction.
      
      Several unit tests are updated to work with this new manual compaction behavior.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11468
      
      Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*`
      
      Reviewed By: ajkr
      
      Differential Revision: D46079619
      
      Pulled By: cbi42
      
      fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f
      e95cc121
    • Y
      Add support to strip / pad timestamp when creating / reading a block based table (#11495) · 9f7877f2
      Yu Zhang 提交于
      Summary:
      Add support to strip timestamp in block based table builder and pad timestamp in block based table reader.
      
      On the write path, use the per column family option `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` to indicate whether user-defined timestamps should be stripped for all block based tables created for the column family.
      
      On the read path, added a per table `TableReadOption.user_defined_timestamps_persisted` to flag whether the user keys in the table contains user defined timestamps.
      
      This patch is mostly passing the related flags down to the block building/parsing level with the exception of handling the `first_internal_key` in `IndexValue`, which is included in the `IndexBuilder` level.  The value part of range deletion entries should have a similar handling, I haven't decided where to best fit this piece of logic, I will do it in a follow up.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11495
      
      Test Plan:
      Existing test `BlockBasedTableReaderTest` is parameterized to run with:
      1) different UDT test modes: kNone, kNormal, kStripUserDefinedTimestamp
      2) all four index types, when index type is `kTwoLevelIndexSearch`, also enables partitioned filters
      3) parallel vs non-parallel compression
      4) enable/disable compression dictionary.
      
      Also added tests for API `BlockBasedTableReader::NewIterator`.
      
      `PartitionedFilterBlockTest` is parameterized to run with different UDT test modes:kNone, kNormal, kStripUserDefinedTimestamp.
      
      ```
      make all check
      ./block_based_table_reader_test
      ./partitioned_filter_block_test
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D46344577
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 93ac8542b19319d1298712b8bed908c8831ba675
      9f7877f2