提交 · 94c247bff88e95355411d4649823641c833c2a0a · kvdb / rocksdb

27 6月, 2023 2 次提交

Update HISTORY.md for branch cut for 8.4.fb (#11565) · 94c247bf

由 akankshamahajan 提交于 6月 26, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11565

Reviewed By: jowlyzhang, cbi42

Differential Revision: D47027788

Pulled By: akankshamahajan15

fbshipit-source-id: e5e8db2eb21f8aa68fe072f0e1b63b83ba7beb9f

94c247bf

Fix extra prefetching when num_file_reads_for_auto_readahead is 1 in async_io (#11560) · ff1cc8a6

由 akankshamahajan 提交于 6月 26, 2023

Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D47008102

Pulled By: akankshamahajan15

fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0

ff1cc8a6

24 6月, 2023 2 次提交

Add CreateColumnFamilyWithImport to `StackableDB` and `DBImplReadOnly` (#11556) · ca50ccc7

由 Changyu Bi 提交于 6月 23, 2023

Summary:
https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556

Test Plan: check if this fixes an internal build

Reviewed By: akankshamahajan15

Differential Revision: D46980506

Pulled By: cbi42

fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc

ca50ccc7

Add an interface to provide support for underlying FS to pass their own buffer... · fbd2f563

由 akankshamahajan 提交于 6月 23, 2023

Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324)

Summary:
1. Public API change: Replace `use_async_io` API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.

2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324

Test Plan: Tested locally

Reviewed By: anand1976

Differential Revision: D44465322

Pulled By: akankshamahajan15

fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec

fbd2f563

23 6月, 2023 1 次提交

Fix crash_test crash (#11554) · fb5748de

由 Yu Zhang 提交于 6月 22, 2023

Summary:
`table_properties_` is not guaranteed to be available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11554

Reviewed By: akankshamahajan15

Differential Revision: D46944170

Pulled By: jowlyzhang

fbshipit-source-id: 609d598e75b417471c9cd964cc316453776a2135

fb5748de

22 6月, 2023 3 次提交

Use FaultInjectionTestFS in transaction_test, clarify Close() APIs (#11499) · 05a1d52e

由 Peter Dillinger 提交于 6月 21, 2023

Summary:
... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271

Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts.

Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499

Test Plan: watch CI

Reviewed By: jowlyzhang

Differential Revision: D46375708

Pulled By: pdillinger

fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40

05a1d52e

Record the `persist_user_defined_timestamps` flag in manifest (#11515) · 7521478b

由 Yu Zhang 提交于 6月 21, 2023

Summary:
Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515

Test Plan:
```
make all check
./version_edit_test
```

Reviewed By: ltamasi

Differential Revision: D46920386

Pulled By: jowlyzhang

fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26

7521478b

Internal API for generating semi-random salt (#11331) · 98c6d7fd

由 Peter Dillinger 提交于 6月 21, 2023

Summary:
... so that a non-cryptographic whole file checksum would be highly resistant
to manipulation by a user able to manipulate key-value data (e.g. a user whose data is
stored in RocksDB) and able to predict SST metadata such as DB session id and file
number based on read access to logs or DB files. The adversary would also need to predict
the salt in order to influence the checksum result toward collision with another file's
checksum.

This change is just internal code to support such a future feature. I think this should be a
passive feature, not option-controlled, because you probably won't think about needing it
until you discover you do need it, and it should be low cost, in space (16 bytes per SST
file) and CPU.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11331

Test Plan: Unit tests added to verify at least pseudorandom behavior. (Actually caught a bug in first draft!) The new "stress" style tests run in ~3ms each on my system.

Reviewed By: ajkr

Differential Revision: D46129415

Pulled By: pdillinger

fbshipit-source-id: 7972dc74487e062b29b1fd9c227425e922c98796

98c6d7fd

21 6月, 2023 1 次提交

Add missing parameter in C API (#11542) · 2926e071

由 Alexandre Lavigne 提交于 6月 20, 2023

Summary:
The class `NewCompactOnDeletionCollectorFactory` exposes the parameter `delete_ratio`.

The C API `rocksdb_options_add_compact_on_deletion_collector_factory` does not allow a user to pass a delete ration to be passed down the the C++ class bellow.

The class has default value for the delete ratio which makes it pass the compilation and the tests.

closes https://github.com/facebook/rocksdb/issues/11541

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11542

Reviewed By: ajkr

Differential Revision: D46770908

Pulled By: cbi42

fbshipit-source-id: 7b5162fe459896052e392e2d85a8f6c01db3b464

2926e071

20 6月, 2023 2 次提交

Attempt to deflake DBWALTestWithEnrichedEnv.SkipDeletedWALs (#11537) · 022d8954

由 Levi Tamasi 提交于 6月 19, 2023

Summary:
Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537

Test Plan:
```
gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="*SkipDeletedWALs*"
```

Reviewed By: pdillinger

Differential Revision: D46736050

Pulled By: ltamasi

fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e

022d8954

Initialize StressTest::optimistic_txn_db_ in ctor (#11547) · b3edb873

由 Levi Tamasi 提交于 6月 19, 2023

Summary:
`StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which
can lead to assertion failures down the line in `StressTest::Open`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547

Reviewed By: cbi42

Differential Revision: D46845658

Pulled By: ltamasi

fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2

b3edb873

18 6月, 2023 1 次提交

Stress/Crash Test for OptimisticTransactionDB (#11513) · 17d52005

由 Jay Huh 提交于 6月 17, 2023

Summary:
Context:
OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB.
1. Adding the following gflag options to to test OptimisticTransactionDB
- `use_optimistic_txn`: When true, open OptimisticTransactionDB to test
- `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default.
- `share_occ_lock_buckets`: Use shared occ locks
- `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock.
2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc`
3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk

Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513

Test Plan:
**Crash Test**
Serial Validation
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0"
make crash_test -j
```
Parallel Validation (no share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0"
make crash_test -j
```
Parallel Validation (share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500"
make crash_test -j
```

**Stress Test**
```
./db_stress -use_optimistic_txn -threads=32
```

Reviewed By: pdillinger

Differential Revision: D46547387

Pulled By: jaykorean

fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960

17d52005

17 6月, 2023 3 次提交

Add UT to test BG read qps behavior during upgrade for pr11406 (#11522) · 1da9ac23

由 Hui Xiao 提交于 6月 16, 2023

Summary:
**Context/Summary:**
When db is upgrading to adopt [pr11406](https://github.com/facebook/rocksdb/pull/11406/), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen.

Misc: refactor the related UTs a bit to make this new UT more readable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11522

Test Plan:
- New UT
If logic of upgrade is wrong e.g,
```
 --- a/table/block_based/partitioned_index_reader.cc
+++ b/table/block_based/partitioned_index_reader.cc
@@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies(
   uint64_t prefetch_len = last_off - prefetch_off;
   std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
   if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
-      tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
+      (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) {
```
, then the UT will fail like below
```
[ RUN      ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0
file/prefetch_test.cc:461: Failure
Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33
Received signal 11 (Segmentation fault)
```

Reviewed By: pdillinger

Differential Revision: D46546707

Pulled By: hx235

fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e

1da9ac23

Fix error case memory bug in GetHostName() (#11544) · 66499780

由 Yu Zhang 提交于 6月 16, 2023

Summary:
Fix the error handling in `GetHostName` for non EFAULT, non EINVAL error. Current handling will cause stack overflow when non null-terminated c style string is in `name`, e.g. ENAMETOOLONG, when the `name` buffer is not big enough and the host name is truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11544

Test Plan:
```
COMPILE_WITH_ASAN=1 make all check
```

Reviewed By: pdillinger

Differential Revision: D46775799

Pulled By: jowlyzhang

fbshipit-source-id: e0fc9400c50fe38bc1fd888b4fea5fe8706165bf

66499780

Add a ticker to track number of trash files deleted in background thread (#11540) · b421a8c2

由 Yu Zhang 提交于 6月 16, 2023

Summary:
This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540

Test Plan:
```
./delete_scheduler_test
```

Reviewed By: ajkr

Differential Revision: D46746401

Pulled By: jowlyzhang

fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df

b421a8c2

16 6月, 2023 4 次提交

Make option `level_compaction_dynamic_level_bytes` true by default (#11525) · bc04ec85

由 Changyu Bi 提交于 6月 15, 2023

Summary:
after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.

The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit https://github.com/facebook/rocksdb/pull/11525/commits/f742be330ca1a7abc33107b00df99818f71c387b and https://github.com/facebook/rocksdb/pull/11525/commits/b1928e42b34c0d4f1cc5d5239149870c6dc7a737 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.

Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525

Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.

Reviewed By: ajkr

Differential Revision: D46654256

Pulled By: cbi42

fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1

bc04ec85

Move the status judgment into the block (#11534) · 253bc919

由 yaphet 提交于 6月 15, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11534

Reviewed By: cbi42

Differential Revision: D46732248

Pulled By: ajkr

fbshipit-source-id: c9360866c35a2c436ab83b85a14ed7bd43a88d95

253bc919

Support Error Recovery Retry Flush in GetFlushReasonString (#11536) · 9f774baa

由 shadowlux 提交于 6月 15, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11536

Reviewed By: cbi42

Differential Revision: D46732297

Pulled By: ajkr

fbshipit-source-id: 82bb078f189da233addc4f483eaa6eaf7fdd3910

9f774baa

Support to create a CF by importing multiple non-overlapping CFs (#11378) · fa878a01

由 mayue.fight 提交于 6月 15, 2023

Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend, all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.

This PR is mainly used to quickly merge multiple DBs into one.

We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.

The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378

Reviewed By: ajkr

Differential Revision: D46413709

Pulled By: cbi42

fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb

fa878a01

15 6月, 2023 2 次提交

Avoid destroying default PosixEnv, safely (#11538) · 70bf5ef0

由 Peter Dillinger 提交于 6月 14, 2023

Summary:
Use another static object to join threads instead.

This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening
during static destruction.

I didn't see anything else in PosixEnv or base classes that would cause a problem by not
destroying. (WinEnv is already not destroyed; see env_default.cc)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in
$
```

Test Plan:
test added, which would previously fail with UBSAN:

```
$ ./env_test --gtest_filter=*Destruct*
Note: Google Test filter = *Destruct*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EnvTestMisc
[ RUN      ] EnvTestMisc.StaticDestruction
[       OK ] EnvTestMisc.StaticDestruction (0 ms)
[----------] 1 test from EnvTestMisc (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env'
0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE'
 00 00 00 00  90 a7 f7 95 7b 7f 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'N7rocksdb12ConfigurableE'

Reviewed By: jowlyzhang

Differential Revision: D46737389

Pulled By: pdillinger

fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b

70bf5ef0

Do not include last level in compaction when `allow_ingest_behind=true` (#11489) · 15e8a843

由 Changyu Bi 提交于 6月 14, 2023

Summary:
when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489

Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction.

Reviewed By: ajkr

Differential Revision: D46455711

Pulled By: cbi42

fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c

15e8a843

14 6月, 2023 3 次提交

add property "rocksdb.obsolete-sst-files-size" (#11533) · cac3240c

由 Andrew Kryczka 提交于 6月 13, 2023

Summary:
See "unreleased_history/new_features/obsolete_sst_files_size.md" for description

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533

Test Plan: updated unit test

Reviewed By: jowlyzhang

Differential Revision: D46703152

Pulled By: ajkr

fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102

cac3240c

Fix failed CI job "Check buck targets and code format" (#11532) · e178f9e4

由 Changyu Bi 提交于 6月 13, 2023

Summary:
the first CI step "Check buck targets and code format..." is failing with the following error message:
```
Download action repository 'wei/wget@v1' (SHA:c15e476d1463f4936cb54f882170d9d631f1aba5)
Error: An action could not be found at the URI 'https://api.github.com/repos/wei/wget/tarball/c15e476d1463f4936cb54f882170d9d631f1aba5'
```
Not sure why the action is lost, but it seems we can use wget directly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11532

Test Plan: watch CI job "Check buck targets and code format"  passes

Reviewed By: ltamasi

Differential Revision: D46700626

Pulled By: cbi42

fbshipit-source-id: 53c09f27965444b533b3fe3755aec922571bba2c

e178f9e4

Fix info_log comment in SSTFileManager (#11530) · 6041e50e

由 Hui Xiao 提交于 6月 13, 2023

Summary:
**Context/Summary:** as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11530

Test Plan: no code change

Reviewed By: jowlyzhang

Differential Revision: D46655670

Pulled By: hx235

fbshipit-source-id: 3dfea5485a9a6a8ce0de1f6296c4cf31aedfa116

6041e50e

13 6月, 2023 1 次提交

Fix typo in twitter link (#11529) · a2a90f89

由 Yu Zhang 提交于 6月 12, 2023

Summary:
A user find and reported this in issue https://github.com/facebook/rocksdb/issues/11528

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11529

Reviewed By: hx235

Differential Revision: D46653618

Pulled By: jowlyzhang

fbshipit-source-id: aa98f59028daaaa5b0bd3b96ee87dcdd7369a8b9

a2a90f89

10 6月, 2023 1 次提交

statistics.cc: fix mistype (#11509) · 7c67aee4

由 Ignat Loskutov 提交于 6月 09, 2023

Summary:
Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509

Reviewed By: ltamasi

Differential Revision: D46542809

Pulled By: jowlyzhang

fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e

7c67aee4

09 6月, 2023 2 次提交

Add missing stopwatch and perf timer to DBImplReadOnly (#11521) · 05fcacdb

由 Ignat Loskutov 提交于 6月 08, 2023

Summary:
The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](https://github.com/facebook/rocksdb/blob/4dafa5b220c439b7f05eb598a2ab32b3152bedb5/db/db_impl/db_impl_secondary.cc#L366-L367).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11521

Reviewed By: cbi42

Differential Revision: D46577330

Pulled By: jowlyzhang

fbshipit-source-id: be147923e763af32bbc18fd6bdf3aff8ebf08aee

05fcacdb

Fix use after move in data block hash index (#11505) · 77dda0d9

由 Yu Zhang 提交于 6月 08, 2023

Summary:
Fix a use-after-move issue in block.cc and added some unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505

Test Plan:
```
make all check
./block_test
```

Reviewed By: ltamasi

Differential Revision: D46506188

Pulled By: jowlyzhang

fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5

77dda0d9

08 6月, 2023 1 次提交

Fix old comment about HyperClockCache and SecondaryCache (#11517) · 2b2994c8

由 Peter Dillinger 提交于 6月 07, 2023

Summary:
Support was added in 8.1.0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11517

Test Plan: comments only

Reviewed By: anand1976

Differential Revision: D46489929

Pulled By: pdillinger

fbshipit-source-id: 4fd30078389065c9ec225bf55b6773f1641f0646

2b2994c8

07 6月, 2023 3 次提交

Fix higher read qps during db open caused by pr 11406 (#11516) · 3093d98c

由 Hui Xiao 提交于 6月 06, 2023

Summary:
**Context:**
[PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e

3093d98c

Fix subcompaction bug to allow running two subcompactions (#11501) · 2e8cc98a

由 Changyu Bi 提交于 6月 06, 2023

Summary:
as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501

Test Plan:
* Add a new unit test.
* Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple  --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0`
  * saw logs showing compactions being executed as 2 subcompactions
```
2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]}
```

Reviewed By: ajkr

Differential Revision: D46411497

Pulled By: cbi42

fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258

2e8cc98a

IterKey: change space_[32] to 39 to utilize padding space (#10633) · ddfcbea3

由 leipeng 提交于 6月 06, 2023

Summary:
This PR utilize padding space.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10633

Reviewed By: pdillinger

Differential Revision: D46410978

Pulled By: ajkr

fbshipit-source-id: 23ec757b1eea9221c1390971e39d341c6b7f2003

ddfcbea3

06 6月, 2023 3 次提交

Fix unit test `DBRangeDelTest.NonBottommostCompactionDropRangetombstone` (#11512) · 633c738a

由 Changyu Bi 提交于 6月 05, 2023

Summary:
Fix the test added in https://github.com/facebook/rocksdb/issues/11459 that is failing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11512

Test Plan: `./db_range_del_test --gtest_filter="*NonBottommostCompactionDropRangetombstone"`

Reviewed By: pdillinger

Differential Revision: D46451450

Pulled By: cbi42

fbshipit-source-id: bcad20b8fd21c4f71924cec6cb045ee4b2038b90

633c738a

switch to use RocksDB UnorderedMap (#11507) · 4dafa5b2

由 Yu Zhang 提交于 6月 05, 2023

Summary:
Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507

Test Plan:
```
make all check
```

Reviewed By: ltamasi

Differential Revision: D46448975

Pulled By: jowlyzhang

fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99

4dafa5b2

Drop range tombstone during non-bottommost compaction (#11459) · 4aa52d89

由 Changyu Bi 提交于 6月 05, 2023

Summary:
Similar to point tombstones, we can drop a range tombstone during compaction when we know its range does not exist in any higher level. This PR adds this optimization. Some existing test in db_range_del_test is fixed to work under this optimization.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11459

Test Plan:
* Add unit test `DBRangeDelTest, NonBottommostCompactionDropRangetombstone`.
* Ran crash test that issues range deletion for a few hours: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --delrangepercent=10 --writepercent=31 --readpercent=40`

Reviewed By: ajkr

Differential Revision: D46007904

Pulled By: cbi42

fbshipit-source-id: 3f37205b6778b7d55ed106369ca41b0632a6d0fd

4aa52d89

03 6月, 2023 2 次提交

Small improvements to DBGet microbenchmark (#11498) · 687a2a0d

由 Andrew Kryczka 提交于 6月 02, 2023

Summary:
Follow a couple best practices:

- Allowed Google benchmark to decide number of iterations. Previously we hardcoded a value, which circumvented benchmark's heuristic for iterating until the result is stable.
- Made each iteration do similar work. Previously, an iteration could do different work depending if the key was found in the first, second, third, or no L0 file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11498

Test Plan: none as I am unable to prove it is better

Reviewed By: hx235

Differential Revision: D46339050

Pulled By: ajkr

fbshipit-source-id: fcfc6da4111c5b3ae86d79d908afc5f61f96675b

687a2a0d

Some fixes to unreleased_history/ (#11504) · 7a9b264f

由 Peter Dillinger 提交于 6月 02, 2023

Summary:
* Add a "Performance Improvements" section
* Add required copyright headers

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11504

Test Plan: manual

Reviewed By: hx235

Differential Revision: D46405128

Pulled By: pdillinger

fbshipit-source-id: 4f878dfd0170d381d3051a44c13479c860e812c0

7a9b264f

02 6月, 2023 3 次提交

Log correct compaction score for Universal Compaction (#11487) · 71ca9a1d

由 Changyu Bi 提交于 6月 01, 2023

Summary:
currently 0 is incorrectly logged as the compaction score for L0 when num_levels > 1. This PR fixes the issue to log the correct score.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11487

Test Plan:
```
./db_bench --benchmarks=fillrandom --max_background_jobs=8 --num=1000000  --compaction_style=1 --stats_dump_period_sec=20 --num_levels=7 --write_buffer_size=1048576

grep "L0   " /tmp/rocksdbtest-543376/dbbench/LOG

before:
** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      9.9      0.42              0.33         9    0.046       0      0       0.0       0.0
L0      3/1    1.37 MB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.1      0.6      9.6      3.76              3.03        76    0.050     34K    140       0.0       0.0
L0      2/0    2.26 MB   0.0      0.0     0.0      0.0       0.1      0.1       0.0   1.6      3.2      8.2     12.59             11.17       163    0.077    619K   5499       0.0       0.0

after: compaction scores are non-zero
L0      0/0    0.00 KB   0.8      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      9.6      0.43              0.34         9    0.048       0      0       0.0       0.0
L0      2/1   937.08 KB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   1.1      0.6      9.3      3.85              3.07        75    0.051     34K    165       0.0       0.0
L0      2/2    1.82 MB   1.0      0.0     0.0      0.0       0.1      0.1       0.0   1.6      3.0      8.0     12.45             10.99       160    0.078    577K   5399       0.0       0.0

```

Reviewed By: ajkr

Differential Revision: D46293993

Pulled By: cbi42

fbshipit-source-id: 19753f7df68c5f54a84c4ed52794f83e510c9721

71ca9a1d

`CompactRange()` always compacts to bottommost level for leveled compaction (#11468) · e95cc121

由 Changyu Bi 提交于 6月 01, 2023

Summary:
currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: https://github.com/cbi42/rocksdb/blob/796f58f42ad1bdbf49e5fcf480763f11583b790e/db/db_impl/db_impl_compaction_flush.cc#L1180C29-L1184).

This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction.

Several unit tests are updated to work with this new manual compaction behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11468

Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*`

Reviewed By: ajkr

Differential Revision: D46079619

Pulled By: cbi42

fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f

e95cc121

Add support to strip / pad timestamp when creating / reading a block based table (#11495) · 9f7877f2

由 Yu Zhang 提交于 6月 01, 2023

Summary:
Add support to strip timestamp in block based table builder and pad timestamp in block based table reader.

On the write path, use the per column family option `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` to indicate whether user-defined timestamps should be stripped for all block based tables created for the column family.

On the read path, added a per table `TableReadOption.user_defined_timestamps_persisted` to flag whether the user keys in the table contains user defined timestamps.

This patch is mostly passing the related flags down to the block building/parsing level with the exception of handling the `first_internal_key` in `IndexValue`, which is included in the `IndexBuilder` level.  The value part of range deletion entries should have a similar handling, I haven't decided where to best fit this piece of logic, I will do it in a follow up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11495

Test Plan:
Existing test `BlockBasedTableReaderTest` is parameterized to run with:
1) different UDT test modes: kNone, kNormal, kStripUserDefinedTimestamp
2) all four index types, when index type is `kTwoLevelIndexSearch`, also enables partitioned filters
3) parallel vs non-parallel compression
4) enable/disable compression dictionary.

Also added tests for API `BlockBasedTableReader::NewIterator`.

`PartitionedFilterBlockTest` is parameterized to run with different UDT test modes:kNone, kNormal, kStripUserDefinedTimestamp.

```
make all check
./block_based_table_reader_test
./partitioned_filter_block_test
```

Reviewed By: ltamasi

Differential Revision: D46344577

Pulled By: jowlyzhang

fbshipit-source-id: 93ac8542b19319d1298712b8bed908c8831ba675

9f7877f2

kvdb / rocksdb 10 个月 前同步成功

kvdb / rocksdb
10 个月前同步成功