提交 · 229297d1b83c1885e7db2573b9b44736a7be23a5 · kvdb / rocksdb

23 2月, 2023 1 次提交

Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) · 229297d1

由 Changyu Bi 提交于 2月 22, 2023

Summary:
A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).

The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113

Test Plan:
* added unit test in db_range_del_test
* crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`

Reviewed By: ajkr

Differential Revision: D42655709

Pulled By: cbi42

fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c

229297d1

22 2月, 2023 1 次提交

Update HISTORY.md and version.h for 8.0 release (#11238) · 28608045

由 Andrew Kryczka 提交于 2月 21, 2023

Summary:
8.0.fb branch is cut so changes going forward will be part of 8.1. Updated version.h and HISTORY.md accordingly

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11238

Reviewed By: cbi42

Differential Revision: D43428345

Pulled By: ajkr

fbshipit-source-id: d344b6e504c81a85563ae9d3705b11c533b1cd43

28608045

18 2月, 2023 2 次提交

Remove FactoryFunc from LoadXXXObject (#11203) · b6640c31

由 mrambacher 提交于 2月 17, 2023

Summary:
The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required.

Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203

Reviewed By: cbi42

Differential Revision: D43160255

Pulled By: pdillinger

fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c

b6640c31

Merge operator failed subcode (#11231) · 25e13652

由 Andrew Kryczka 提交于 2月 17, 2023

Summary:
From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.

This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231

Test Plan: updated unit test

Reviewed By: akankshamahajan15

Differential Revision: D43396607

Pulled By: ajkr

fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9

25e13652

17 2月, 2023 2 次提交

Return NotSupported in scan if IOUring not supported and enable IOUring in... · 68e4581c

由 akankshamahajan 提交于 2月 16, 2023

Return NotSupported in scan if IOUring not supported and enable IOUring in db_stress for async_io testing (#11197)

Summary:
- Return NotSupported in scan if IOUring not supported if async_io is enabled
- Enable IOUring in db_stress for async_io testing
- Disable async_io in circleci crash testing as circleci doesn't support IOUring

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D43096313

Pulled By: akankshamahajan15

fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda

68e4581c

Customize CompressedSecondaryCache by block kind (#11204) · 64a1f767

由 Peter Dillinger 提交于 2月 16, 2023

Summary:
Added `do_not_compress_roles` to `CompressedSecondaryCacheOptions` to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11204

Test Plan: unit test added

Reviewed By: anand1976

Differential Revision: D43147698

Pulled By: pdillinger

fbshipit-source-id: db496975ae975fa18f157f93fe131a16315ac875

64a1f767

16 2月, 2023 1 次提交

Mention the new MultiGetEntity API in HISTORY.md (#11226) · 94ec4338

由 Levi Tamasi 提交于 2月 15, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11226

Reviewed By: akankshamahajan15

Differential Revision: D43317602

Pulled By: ltamasi

fbshipit-source-id: 4b7e063848d3cfbdb9f0c0f54d68aeab8a82595c

94ec4338

10 2月, 2023 1 次提交

Put Cache and CacheWrapper in new public header (#11192) · 3cacd4b4

由 Peter Dillinger 提交于 2月 09, 2023

Summary:
The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.

In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544

Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.

Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192

Test Plan: `make check`

Reviewed By: anand1976

Differential Revision: D43055487

Pulled By: pdillinger

fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110

3cacd4b4

09 2月, 2023 1 次提交

Fix bug in WAL streaming uncompression (#11198) · 77b61abc

由 anand76 提交于 2月 08, 2023

Summary:
Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198

Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.

Reviewed By: ajkr, cbi42

Differential Revision: D43102692

Pulled By: anand1976

fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd

77b61abc

08 2月, 2023 2 次提交

Add compaction filter support for wide-column entities (#11196) · 876d2815

由 Levi Tamasi 提交于 2月 07, 2023

Summary:
The patch adds compaction filter support for wide-column entities by introducing
a new `CompactionFilter` API called `FilterV3`. This API is called for regular
key-values, merge operands, and wide-column entities as well. It is passed the
existing value/operand or wide-column structure and it can update the value or
columns or keep/delete/etc. the key-value as usual. For compatibility, the default
implementation of `FilterV3` keeps all wide-column entities and falls back to calling
`FilterV2` for plain old key-values and merge operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43094147

Pulled By: ltamasi

fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd

876d2815

Remove a couple deprecated convenience.h APIs (#11120) · 6650ca24

由 Hui Xiao 提交于 2月 07, 2023

Summary:
**Context/Summary:**
As instructed by convenience.h comments, a few deprecated APIs are removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11120

Test Plan:
- make check & CI
- eyeball check on test semantics.

Reviewed By: pdillinger

Differential Revision: D42937507

Pulled By: hx235

fbshipit-source-id: a9e4709387da01b1d0e9148c2e210f02e9746ee1

6650ca24

04 2月, 2023 2 次提交

Use LIB_MODE=shared build by default with make (#11168) · cf756ed9

由 Peter Dillinger 提交于 2月 03, 2023

Summary:
With https://github.com/facebook/rocksdb/issues/11150 this becomes a practical change that I think is overall good for developer efficiency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11168

Test Plan:
More efficient build of all unit tests and tools:

```
$ git clean -fdx
$ du -sh .
522M    .
$ /usr/bin/time make -j32 LIB_MODE=static
...
14270.63user 1043.33system 11:19.85elapsed 2252%CPU (0avgtext+0avgdata 1929944maxresident)k
...
$ du -sh .
62G     .
$
```
Vs.
```
$ git clean -fdx
$ du -sh .
522M    .
$ /usr/bin/time make -j32 LIB_MODE=shared
...
9479.87user 478.26system 7:20.82elapsed 2258%CPU (0avgtext+0avgdata 1929272maxresident)k
...
$ du -sh .
5.4G    .
$
```

So 1/3 less build time and >90% less space usage.

Individual unit test edit-compile-run is not too different. Modifying an average unit test source file:
```
$ touch db/version_builder_test.cc
$ /usr/bin/time make -j32 LIB_MODE=static version_builder_test
...
34.74user 3.37system 0:38.29elapsed 99%CPU (0avgtext+0avgdata 945520maxresident)k
```
Vs.
```
$ touch db/version_builder_test.cc
$ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test
...
116.26user 43.91system 0:28.65elapsed 559%CPU (0avgtext+0avgdata 675160maxresident)k
```
A little faster with shared.

However, modifying an average DB implementation file has an extra linking step with shared lib:
```
$ touch db/db_impl/db_impl_files.cc
$ /usr/bin/time make -j32 LIB_MODE=static version_builder_test
...
33.17user 5.13system 0:39.70elapsed 96%CPU (0avgtext+0avgdata 945544maxresident)k
```
Vs.
```
$ touch db/db_impl/db_impl_files.cc
$ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test
...
40.80user 4.66system 0:45.54elapsed 99%CPU (0avgtext+0avgdata 1056340maxresident)k
```
A little slower with shared.

On the whole, should be faster and lighter weight because of the many unit test files case

Reviewed By: cbi42

Differential Revision: D42894004

Pulled By: pdillinger

fbshipit-source-id: 9e827e52ace79b86f849b6a24466e318b4b605a7

cf756ed9

Deprecate write_global_seqno and default to false (#11179) · 0cf1008f

由 Peter Dillinger 提交于 2月 03, 2023

Summary:
This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way.

Also improved API comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179

Test Plan: existing tests (with one tiny update)

Reviewed By: hx235

Differential Revision: D42973927

Pulled By: pdillinger

fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab

0cf1008f

03 2月, 2023 1 次提交

Return any errors returned by ReadAsync to the MultiGet caller (#11171) · 63da9cfa

由 anand76 提交于 2月 02, 2023

Summary:
Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.

Tests:
Update existing unit tests and add a new one for this scenario

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171

Reviewed By: akankshamahajan15

Differential Revision: D42950057

Pulled By: anand1976

fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869

63da9cfa

02 2月, 2023 1 次提交

Update HISTORY.md for #11136 (#11177) · 6af16ac7

由 Andrew Kryczka 提交于 2月 02, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11177

Reviewed By: cbi42

Differential Revision: D42948946

Pulled By: ajkr

fbshipit-source-id: 783d3d9007faaa036923a0364cdd0bfbd8e78062

6af16ac7

01 2月, 2023 2 次提交

Allow canceling manual compaction while waiting for conflicting compaction (#11165) · 071c3384

由 Andrew Kryczka 提交于 1月 31, 2023

Summary:
This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165

Test Plan: repro test case

Reviewed By: cbi42

Differential Revision: D42864058

Pulled By: ajkr

fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a

071c3384

Fix a bug where GetEntity would expose a blob reference (#11162) · a82021c3

由 Levi Tamasi 提交于 1月 31, 2023

Summary:
The patch fixes a feature interaction bug between BlobDB and the `GetEntity` API:
without the patch, `GetEntity` would return the blob reference (wrapped into a
single-column entity) instead of the actual blob value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11162

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D42854092

Pulled By: ltamasi

fbshipit-source-id: f750d0ff57def107da16f545077ddce9860ff21a

a82021c3

31 1月, 2023 2 次提交

Cleanup, improve, stress test LockWAL() (#11143) · 94e3beec

由 Peter Dillinger 提交于 1月 30, 2023

Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:

* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143

Test Plan: unit tests added / updated, added to stress/crash test

Reviewed By: ajkr

Differential Revision: D42848627

Pulled By: pdillinger

fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4

94e3beec

Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8

由 Yu Zhang 提交于 1月 30, 2023

Summary:
Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.

Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105

Test Plan:
make V=1 db_blob_basic_test
./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"

Reviewed By: ltamasi

Differential Revision: D42716487

Pulled By: jowlyzhang

fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a

24ac53d8

28 1月, 2023 2 次提交

Remove RocksDB LITE (#11147) · 4720ba43

由 sdong 提交于 1月 27, 2023

Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2

4720ba43

Remove deprecated util functions in options_util.h (#11126) · 6943ff6e

由 Yu Zhang 提交于 1月 27, 2023

Summary:
Remove the util functions in options_util.h that have previously been marked deprecated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126

Test Plan: `make check`

Reviewed By: ltamasi

Differential Revision: D42757496

Pulled By: jowlyzhang

fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb

6943ff6e

27 1月, 2023 3 次提交

Remove deprecated FileSystem::Load() (#11122) · c94c8fcb

由 Changyu Bi 提交于 1月 26, 2023

Summary:
user should use FileSystem::CreateFromString() instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11122

Reviewed By: ajkr

Differential Revision: D42727580

Pulled By: cbi42

fbshipit-source-id: c68b17bb82ba9dee46ba23b677d87ecf0a1e06c8

c94c8fcb

use verify checksum option in block based table reader Open() (#11099) · a1e92bd9

由 Karim TAAM 提交于 1月 26, 2023

Summary:
## Description
In this issue https://github.com/facebook/rocksdb/issues/11002 we found that when we use rocksdb with the `verify checksum` read_option to false the verification is done anyway

By analyzing the code along the stacktrace I saw that at the level of https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129f we are not keeping all the options and we forget the `verify_checksum`

the comment in this class suggests that it should be managed https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fL581

<img width="1724" alt="204511641-86ab4b9b-45e5-4a2b-a13d-81fa26435d38" src="https://user-images.githubusercontent.com/26581503/213152802-c46bc1c7-a3a2-4a6f-9bb1-bf92ee93af7a.png">

this PR just adds the line to manage the `verify checksum`

## Tests

- Running unit tests
- Test without setting `verify checksum` and verifying that we are calling the checksum code
- Test by setting `verify checksum` to true and verifying that we are calling the checksum code
- Test by setting `verify checksum` to false and verifying that we are **not** calling the checksum code

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11099

Reviewed By: cbi42

Differential Revision: D42679881

Pulled By: ajkr

fbshipit-source-id: c7dd10768282fd0699f7e1bf397ceb7adbea4ab6

a1e92bd9

Fix the HISTORY.md entry related to the removed statistics (#11140) · a6cfdd4e

由 Levi Tamasi 提交于 1月 26, 2023

Summary:
Some histograms were incorrectly categorized as tickers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11140

Reviewed By: anand1976

Differential Revision: D42780030

Pulled By: ltamasi

fbshipit-source-id: 5aca8ec5baad8f73676aaa9d6cdbbd2a619c8a89

a6cfdd4e

26 1月, 2023 2 次提交

Remove more obsolete statistics (#11131) · 6da2e20d

由 Levi Tamasi 提交于 1月 25, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11131

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D42753997

Pulled By: ltamasi

fbshipit-source-id: ce8b84c1e55374257e93ed74fd255c9b759723ce

6da2e20d

Remove deprecated Env::LoadEnv() (#11121) · 9afa0f05

由 Peter Dillinger 提交于 1月 25, 2023

Summary:
Can use Env::CreateFromString() instead

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11121

Test Plan: unit tests updated

Reviewed By: cbi42

Differential Revision: D42723813

Pulled By: pdillinger

fbshipit-source-id: 5d4b5b10225dfdaf662f5f8049ee965a05d3edc9

9afa0f05

25 1月, 2023 3 次提交

Remove some deprecated/obsolete statistics from the API (#11123) · 99e55953

由 Levi Tamasi 提交于 1月 24, 2023

Summary:
These tickers/histograms have been obsolete (and not populated) for a long time.
The patch removes them from the API completely. Note that this means that the
numeric values of the remaining tickers change in the C++ code as they get shifted up.
This should be OK: the values of some existing tickers have changed many times
over the years as items have been added in the middle. (In contrast, the convention
in the Java bindings is to keep the ids, which are not guaranteed to be the same
as the ids on the C++ side, the same across releases.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11123

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D42727793

Pulled By: ltamasi

fbshipit-source-id: e058a155a20b05b45f53e67ee380aece1b43b6c5

99e55953

Remove compressed block cache (#11117) · 2800aa06

由 sdong 提交于 1月 24, 2023

Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117

Test Plan: See CI passes

Reviewed By: pdillinger

Differential Revision: D42700164

fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d

2800aa06

Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job... · 86fa2592

由 Hui Xiao 提交于 1月 24, 2023

Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)

Summary:
**Context:**
Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush  `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `

**Summary:**
Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)

**Tets:**
- new unit test
- make check
- aggressive crash test rehearsal

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111

Reviewed By: ajkr

Differential Revision: D42644600

Pulled By: hx235

fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf

86fa2592

24 1月, 2023 1 次提交

Update HISTORY.md/version.h/format compatiblity test for 7.10 release (#11114) · 7e754847

由 Hui Xiao 提交于 1月 23, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11114

Reviewed By: ajkr

Differential Revision: D42685234

Pulled By: hx235

fbshipit-source-id: 79908a66ab9052a2552f080049065462ebf2f94c

7e754847

21 1月, 2023 1 次提交

Add API to limit blast radius of merge operator failure (#11092) · b7fbcefd

由 Andrew Kryczka 提交于 1月 20, 2023

Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.

To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092

Reviewed By: siying

Differential Revision: D42525673

Pulled By: ajkr

fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee

b7fbcefd

20 1月, 2023 1 次提交

Upgrade xxhash.h to latest dev (#11098) · fd911f96

由 Peter Dillinger 提交于 1月 19, 2023

Summary:
Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines.

I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example
* xxhash.h has its own fallthrough marker (which I hope works for us)
* As in https://github.com/Cyan4973/xxHash/pull/549

Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)`

How I did the upgrade (for future reference), so that I could use usual merge conflict resolution:
```
# New branch to help with merging
git checkout -b xxh_merge_base
# Check out RocksDB revision before last xxhash.h upgrade
git reset --hard 22161b75^
# Create a commit with the raw base version from xxHash repo (from xxHash repo)
git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h
# In RocksDB repo
git commit -a
# Merge in the last xxhash.h upgrade
git merge 22161b75
# Resolve conflict using committed version
git show 22161b75:util/xxhash.h > util/xxhash.h
git commit -a
# Catch up to upstream
git merge upstream/main

# Create a different branch for applying raw upgrade
git checkout -b xxh_upgrade_2023
# Find the RocksDB commit we made for the raw base version from xxHash
git log main..HEAD
# Rewind to it
git reset --hard 2428b727
# Copy in latest raw version (from xxHash repo)
cat xxhash.h > ../rocksdb/util/xxhash.h
# Merge in RocksDB changes, use typical tools for conflict resolution
git merge xxh_merge_base
```

Branch https://github.com/facebook/rocksdb/tree/xxhash_merge_base can be used as a base for future xxhash merges.

Fixes https://github.com/facebook/rocksdb/issues/11073

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11098

Test Plan:
existing tests (e.g. Bloom filter schema stability tests)

Also seems to include a small performance boost on my Intel dev machine, using `./db_bench --benchmarks=xxh3[-X50] 2>&1 | egrep -o 'operations;.*' | sort`

Fastest out of 50 runs, before: 15477.3 MB/s
Fastest out of 50 runs, after: 15850.7 MB/s, and 11 more runs faster than the "before" number

Slowest out of 50 runs, before: 12267.5 MB/s
Slowest out of 50 runs, after: 13897.1 MB/s

More repetitions show the distinction is repeatable

Reviewed By: hx235

Differential Revision: D42560010

Pulled By: pdillinger

fbshipit-source-id: c43ee52f1c5fe0ba3d6d6e4eebb22ded5f5492ea

fd911f96

18 1月, 2023 1 次提交

Consider TTL compaction file cutting earlier to prevent small output file (#11075) · 4d0f9a99

由 Changyu Bi 提交于 1月 17, 2023

Summary:
in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075

Test Plan: - Added new unit test.

Reviewed By: hx235

Differential Revision: D42398739

Pulled By: cbi42

fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7

4d0f9a99

14 1月, 2023 1 次提交

Revert #10802 Consider range tombstone in compaction output file cutting (#11089) · f515d9d2

由 Changyu Bi 提交于 1月 13, 2023

Summary:
This reverts commit f02c708a since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089

Test Plan: existing CI.

Reviewed By: ajkr

Differential Revision: D42505972

Pulled By: cbi42

fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153

f515d9d2

12 1月, 2023 1 次提交

Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5

由 Peter Dillinger 提交于 1月 11, 2023

Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).

The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.

* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)

## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.

## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).

The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.

These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)

Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.

## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.

## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.

The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.

## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)

## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975

Test Plan:
tests updated

Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):

34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83

Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.

Reviewed By: anand1976

Differential Revision: D42417818

Pulled By: pdillinger

fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432

9f7801c5

30 12月, 2022 2 次提交

Add missing range conflict check between file ingestion and RefitLevel() (#10988) · 9502856e

由 Hui Xiao 提交于 12月 29, 2022

Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.

RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.

Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.

Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.

**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them.  File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988

Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: cbi42

Differential Revision: D41535685

Pulled By: hx235

fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258

9502856e

Add BackupEngine feature to exclude files (#11030) · 02f2b208

由 Peter Dillinger 提交于 12月 29, 2022

Summary:
We have a request for RocksDB to essentially support
disconnected incremental backup. In other words, if there is limited
or no connectivity to the primary backup dir, we should still be able to
take an incremental backup relative to that primary backup dir,
assuming some metadata about that primary dir is available (and
obviously anticipating primary backup dir will be fully available if
restore is needed).

To support that, this feature allows the API user to "exclude" DB
files from backup. This only applies to files that can be shared
between backups (sst and blob files), and excluded files are
tracked in the backup metadata sufficiently to ensure they are
restored at restore time. At restore time, the user provides
a set of alternate backup directories (as open BackupEngines, which
can be read-only), and excluded files must be found in one of the
backup directories ("included" in some backup).

This feature depends on backup schema version 2 features, though
schema version 2.0 support is not sufficient to read / restore a
backup with exclusions. This change updates the schema version to
2.1 because of this feature, so that it's easy to recognize whether
a RocksDB release supports this feature, while backups not using the
feature are fully compatible with 2.0.

Also in this PR:
* Stacked on https://github.com/facebook/rocksdb/pull/11029
* Allow progress_callback to be empty, not just no-op function, and
recover from exceptions thrown by BackupEngine callbacks.
* The internal-only `AsBackupEngine()` function is working around the
diamond hierarchy of `BackupEngineImplThreadSafe` to get to the
internals, without using confusing features like virtual inheritance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11030

Test Plan: unit tests added / updated

Reviewed By: ajkr

Differential Revision: D42004388

Pulled By: pdillinger

fbshipit-source-id: 31b6e533d308a5462e528d9012d650482d974077

02f2b208

22 12月, 2022 3 次提交

Avoid mixing sync and async prefetch (#11050) · bec42648

由 anand76 提交于 12月 21, 2022

Summary:
Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050

Test Plan: Crash test

Reviewed By: akankshamahajan15

Differential Revision: D42194682

Pulled By: anand1976

fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e

bec42648

Make CompactRange() more aware of SstPartitionerFactory (#11032) · e6b6e741

由 Peter Dillinger 提交于 12月 21, 2022

Summary:
Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range.

This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032

Test Plan: unit test included

Reviewed By: hx235

Differential Revision: D41981380

Pulled By: pdillinger

fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a

e6b6e741

Fix async prefetch heap use after free (#11049) · dbf37c29

由 anand76 提交于 12月 21, 2022

Summary:
This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario -
1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching
2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache
3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet
4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned
5. The orphaned prefetch finally completes, resulting in a use after free

Also add a few asserts to surface bugs earlier in the crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049

Test Plan: Repro with db_stress and verify the fix

Reviewed By: akankshamahajan15

Differential Revision: D42181118

Pulled By: anand1976

fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc

dbf37c29

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功