提交 · 226ee25d300c25c6ba4ec4050a69b8d48a6d21f3 · kvdb / rocksdb

16 4月, 2023 2 次提交

Block fetch CPU time counters in perf context (#11342) · 226ee25d

由 Murali Vilayannur 提交于 4月 15, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11342

Reviewed By: ajkr

Differential Revision: D45026838

Pulled By: mnv104

fbshipit-source-id: 099ed9579922b8fa6e7d3332bbb829d50ec47d91

226ee25d

Fix the wrong calculation of largest_key in import_column_family_job (#11381) · 4d72f48e

由 mayue.fight 提交于 4月 15, 2023

Summary:
When calculating the largest_key in ImportColumnFamilyJob::GetIngestedFileInfo, only the first element of range_del_iter is calculated. If range_del_iter has multiple elements, the largest_key will be wrong

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11381

Reviewed By: cbi42

Differential Revision: D44981450

Pulled By: ajkr

fbshipit-source-id: 584bc7da86295568a96984d2951644f289e578c7

4d72f48e

15 4月, 2023 2 次提交

Try to pick more files in `LevelCompactionBuilder::TryExtendNonL0TrivialMove()` (#11347) · ba16e8ee

由 Changyu Bi 提交于 4月 14, 2023

Summary:
Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left.

Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347

Test Plan:
* new unit test
* Benchmark does not show obvious improvement or regression:
```
Write sequentially
./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216

Main:
fillseq      :       4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations;   23.4 MB/s
This PR:
fillseq      :       4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations;   23.3 MB/s

Write randomly
./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216

Main:
fillrandom   :      16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations;    6.8 MB/s
This PR:
fillrandom   :      15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations;    7.0 MB/s
```

Reviewed By: ajkr

Differential Revision: D44645650

Pulled By: cbi42

fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733

ba16e8ee

Fix serval bugs in ImportColumnFamilyTest (#11372) · 9500d90d

由 mayue.fight 提交于 4月 14, 2023

Summary:
**Context/Summary:**
ASSERT_EQ will only verify the code of Status, but will not check the state message of Status.

- Assert by checking Status state in `ImportColumnFamilyTest`
- Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372

Reviewed By: ajkr

Differential Revision: D45004343

Pulled By: cbi42

fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b

9500d90d

13 4月, 2023 1 次提交

util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (#11371) · 6b67b561

由 Jeff Palm 提交于 4月 12, 2023

Summary:
util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (and enable checking for the error)

Code would produce errors like this, when compiled with -Wambiguous-reversed-operator under c++20.
```
util/ribbon_test.cc:695:20: error: ISO C++20 considers use of overloaded operator '!=' (with operand types 'KeyGen' (aka '(anonymous namespace)::StandardKeyGen') and 'KeyGen') to be ambiguou
s despite there being a unique best viable function with non-reversed arguments [-Werror,-Wambiguous-reversed-operator]
        while (cur != batch_end) {
               ~~~ ^  ~~~~~~~~~
util/ribbon_test.cc:111:8: note: candidate function with non-reversed arguments
  bool operator!=(const StandardKeyGen& other) {
       ^
util/ribbon_test.cc:107:8: note: ambiguous candidate function with reversed arguments
  bool operator==(const StandardKeyGen& other) {
       ^
```

This will become a hard error in future standards.

Confirmed that no errors were generated when building using clang and c++20:
```
USE_CLANG=1 USE_COROUTINES=1 make
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11371

Reviewed By: meyering

Differential Revision: D44921027

Pulled By: cbi42

fbshipit-source-id: ef25b78260920a4d75a718310688d3a2487ffa87

6b67b561

12 4月, 2023 1 次提交

Initial add UDT in memtable only option (#11362) · 647cd736

由 Yu Zhang 提交于 4月 11, 2023

Summary:
This option is immutable through the life time of the DB open. For now, updating its value between different DB open sessions is also a non compatible change. When I work on support for updating comparator, the type of updates accepted for this option will be supported then.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11362

Test Plan: `make check`

Reviewed By: ltamasi

Differential Revision: D44873870

Pulled By: jowlyzhang

fbshipit-source-id: aa02094754b58d99abf9af4c9a8108c1350254cb

647cd736

11 4月, 2023 2 次提交

fix optimization-disabled test builds with platform010 (#11361) · 760b773f

由 Andrew Kryczka 提交于 4月 10, 2023

Summary:
Fixed the following failure:

```
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc: In function ‘bool testing::internal::StackGrowsDown()’:
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8681:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized]
 8681 |   StackLowerThanAddress(&dummy, &result);
      |   ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8671:13: note: by argument 1 of type ‘const void*’ to ‘void testing::internal::StackLowerThanAddress(const void*, bool*)’ declared here
 8671 | static void StackLowerThanAddress(const void* ptr, bool* result) {
      |             ^~~~~~~~~~~~~~~~~~~~~
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8679:7: note: ‘dummy’ declared here
 8679 |   int dummy;
      |       ^~~~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11361

Reviewed By: cbi42

Differential Revision: D44838033

Pulled By: ajkr

fbshipit-source-id: 27d68b5a24a15723bbaaa7de45ccd70a60fe259e

760b773f

C-API: Constify cache functions where possible (#11243) · d5a9c0c9

由 Niklas Fiekas 提交于 4月 10, 2023

Summary:
Makes it easier to use generated Rust bindings. Constness of these is already part of the C++ API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11243

Reviewed By: hx235

Differential Revision: D44840394

Pulled By: ajkr

fbshipit-source-id: bcd1aeb8c959c304148d25b00043bb8c4cd3e0a4

d5a9c0c9

08 4月, 2023 7 次提交

fix bad implementation of ShardedCache::GetOccupancyCount (#11325) · c8552d8c

由 Zdenek Korcak 提交于 4月 07, 2023

Summary:
copy paste typo

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11325

Reviewed By: hx235

Differential Revision: D44378512

Pulled By: ajkr

fbshipit-source-id: 509ed2697c06eed975914359ece0459a0ea40312

c8552d8c

Add PaxosStore to USERS (#11357) · d30bb3d1

由 nccx 提交于 4月 07, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11357

Reviewed By: hx235

Differential Revision: D44774454

Pulled By: ajkr

fbshipit-source-id: f3912316b6cd4e0b41310590c93f914f1d943044

d30bb3d1

Makefile: fix a typo: PLATFORM_CFLAGS to PLATFORM_CCFLAGS (#11348) · b2c4bc5f

由 leipeng 提交于 4月 07, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11348

Reviewed By: hx235

Differential Revision: D44774863

Pulled By: ajkr

fbshipit-source-id: ba4bd959650228a71fca6bf62840ae9d7373d6f0

b2c4bc5f

Remove deprecated integration tests from README.md (#11354) · 140dd93b

由 nccx 提交于 4月 07, 2023

Summary:
The CI systems other than CircleCI are almost always in a failing state. Since CircleCI covers linux, macos, and windows, we can remove the others.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11354

Reviewed By: hx235

Differential Revision: D44774627

Pulled By: ajkr

fbshipit-source-id: c83b298ec5afe4ea410744eda6cc98fc6a3365f1

140dd93b

Initialize `lowest_unnecessary_level_` in `VersionStorageInfo` constructor (#11359) · 64cead91

由 Changyu Bi 提交于 4月 07, 2023

Summary:
valgrind complains "Conditional jump or move depends on uninitialised value(s)". A sample error message:

```
[ RUN      ] DBCompactionTest.DrainUnnecessaryLevelsAfterDBBecomesSmall
==3353864== Conditional jump or move depends on uninitialised value(s)
==3353864==    at 0x8647B4: rocksdb::VersionStorageInfo::ComputeCompactionScore(rocksdb::ImmutableOptions const&, rocksdb::MutableCFOptions const&) (version_set.cc:3414)
==3353864==    by 0x86B340: rocksdb::VersionSet::AppendVersion(rocksdb::ColumnFamilyData*, rocksdb::Version*) (version_set.cc:4946)
==3353864==    by 0x876B88: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6876)
==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
==3353864==  Uninitialised value was created by a heap allocation
==3353864==    at 0x4842F0F: operator new(unsigned long) (vg_replace_malloc.c:422)
==3353864==    by 0x876AF4: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6870)
==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
==3353864==    by 0x591F73: rocksdb::DBTestBase::Reopen(rocksdb::Options const&) (db_test_util.cc:662)
```

This is likely about `lowest_unnecessary_level_` even though it would be initialized in `CalculateBaseBytes()` before being used in `ComputeCompactionScore()`. Initialize it also in VersionStorageInfo constructor to prevent valgrind from  complaining.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11359

Test Plan: - ran a test with valgrind which gave the error message above before this PR: `valgrind --track-origins=yes ./db_compaction_test  --gtest_filter="*DrainUnnecessaryLevelsAfterDBBecomesSmall*"`

Reviewed By: hx235

Differential Revision: D44799112

Pulled By: cbi42

fbshipit-source-id: 557208a66f04a2163b418b2a651bdb7e777c4511

64cead91

Refactor block cache tracing w/improved MultiGet (#11339) · f9db0c6e

由 Peter Dillinger 提交于 4月 07, 2023

Summary:
After https://github.com/facebook/rocksdb/issues/11301, I wasn't sure whether I had regressed block cache tracing with MultiGet. Demo PR https://github.com/facebook/rocksdb/issues/11330 shows the flawed state of tracing MultiGet before my change, and based on the unit test, there was essentially no change in tracing behavior with https://github.com/facebook/rocksdb/issues/11301. This change is to leave that code and behavior better than I found it.

This change is not intended to change any production behaviors except when block cache tracing is active, though might improve general read path efficiency by disabling some related tracking when such tracing is disabled.

More detail on production code:
* Refactoring to consolidate the construction of BlockCacheTraceRecord, and other related functionality, in block-based table reader, though it's somewhat awkward to preserve an optimization to avoid copying Slices into temporary strings in BlockCacheLookupContext.
* Accurately track cache hits and misses (etc.) for each data block accessed by a MultiGet(). (Previously reported hits as misses.)
* Reduced repeated checking of `block_cache_tracer_` state (by creating lookup_context only when active) for efficiency and to reduce the risk of corner case bugs where tracing is enabled or disabled for different parts of a read op. (See a TODO below)
* Improved estimate calculation for num_keys_in_block (see code comment)

Possible follow-up:
* `XXX:` use_cache=true means double cache query? (possible double-query of block cache when allow_mmap_reads=true)
* `TODO:` need more than one lookup_context here to track individual filter and index partition hits and misses
* `TODO:` optimize more state checks of `block_cache_tracer_` down to `lookup_context != nullptr`
* Pre-existing `XXX:` There appear to be 'break' statements above that bypass this writing of the block cache trace record
* Expand test coverage (see below)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11339

Test Plan:
* Added a basic unit test for block cache tracing MultiGet, for now just covering one data block with two keys.
* Added HitMissCountingCache to independently verify that the actual block cache trace and expected block cache trace also agree with the actual number of cache hits / misses (nothing missing or mislabeled). For now only used with MultiGet test.
* Better testing of num_keys_in_block, for now just with MultiGet
* Misc improvements to table_test to improve clarity, such as making it clear that certain keys are auto-inserted at the start of every test.

Performance test:
Testing multireadrandom as in https://github.com/facebook/rocksdb/issues/11301, except averaging over distinct runs rather than [-X30] which doesn't seem to sufficiently reset after each run to work as an independent test run.

Base with revert of 11301: 3148926 ops/sec
Base: 3019146 ops/sec
New: 2999529 ops/sec

Possibly a tiny MultiGet CPU regression with this change. We are now always allocating an additional vector for the LookupContexts. I'm still contemplating options to try to correct the regression in https://github.com/facebook/rocksdb/issues/11301.

Testing readrandom:
Base with revert of 11301: 2311988
Base: 2281726
New: 2299722

Possibly a tiny Get CPU improvement with this change. We are now avoiding some unnecessary LookupContext population.

Reviewed By: akankshamahajan15

Differential Revision: D44557845

Pulled By: pdillinger

fbshipit-source-id: b841691799d2a48fb59cc8880dc7cbb1e107ae3d

f9db0c6e

Better support for merge operation with data block hash index (#11356) · f631138e

由 Changyu Bi 提交于 4月 07, 2023

Summary:
when data block hash index finds a key of op_type `kTypeMerge`, do not redo data block seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11356

Test Plan:
- added new unit test
- crash test: `python3 tools/db_crashtest.py whitebox --simple --use_merge=1 --data_block_index_type=1`
- benchmark see slight improvement in read throughput:
```
TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=mergerandom -use_existing_db=false -num=10000000 -compression_type=none -level_compaction_dynamic_level_bytes=1 -merge_operator=PutOperator -write_buffer_size=1000000 --use_data_block_hash_index=1

TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=readrandom[-X10] -use_existing_db=true -num=10000000 -merge_operator=PutOperator -readonly=1 -disable_auto_compactions=1 -reads=100000

Main: readrandom [AVG 10 runs] : 29526 (± 1118) ops/sec;    2.1 (± 0.1) MB/sec
Post-PR: readrandom [AVG 10 runs] : 31095 (± 662) ops/sec;    2.2 (± 0.0) MB/sec
```

Reviewed By: pdillinger

Differential Revision: D44759895

Pulled By: cbi42

fbshipit-source-id: 387f0c35938c7e0e96b810ca3babf1967fc68191

f631138e

07 4月, 2023 2 次提交

Filter table files by timestamp: Get operator (#11332) · 0578d9f9

由 Wentian Guo 提交于 4月 06, 2023

Summary:
If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332

Reviewed By: pdillinger

Differential Revision: D44763497

Pulled By: guowentian

fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8

0578d9f9

Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340) · b3c43a5b

由 Changyu Bi 提交于 4月 06, 2023

Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8

b3c43a5b

06 4月, 2023 2 次提交

Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328) · 0623c5b9

由 anand76 提交于 4月 05, 2023

Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328

Test Plan: Add a unit test

Reviewed By: pdillinger

Differential Revision: D44718781

Pulled By: anand1976

fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63

0623c5b9

Fix initialization-order-fiasco in write_stall_stats.cc (#11355) · 7f5b9f40

由 Hui Xiao 提交于 4月 05, 2023

Summary:
**Context/Summary:**
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11355

Test Plan:
- Ran previously failed tests and they succeed
- Perf
`./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`

Reviewed By: ajkr

Differential Revision: D44719333

Pulled By: hx235

fbshipit-source-id: 23d22f314144071d97f7106ff1241c31c0bdf08b

7f5b9f40

05 4月, 2023 3 次提交

Use user-provided ReadOptions for metadata block reads more often (#11208) · b4573862

由 Andrew Kryczka 提交于 4月 04, 2023

Summary:
This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information.

I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously.

Fixes https://github.com/facebook/rocksdb/issues/10463

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208

Test Plan:
Functional:
- Measured `-verify_checksum=false` applies to metadata blocks read outside of table open
  - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56`
  - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true`
  - before: `rocksdb.block.checksum.compute.count COUNT : 384353`
  - after: `rocksdb.block.checksum.compute.count COUNT : 22`

Performance:
- Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1`
- Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads
  - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000`
  - Before: `readrandom [AVG    20 runs] : 274848 (± 3717) ops/sec;    8.4 (± 0.1) MB/sec`
  - After: `readrandom [AVG    20 runs] : 277904 (± 4474) ops/sec;    8.5 (± 0.1) MB/sec`

Reviewed By: hx235

Differential Revision: D43145366

Pulled By: ajkr

fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994

b4573862

Re-clarify SecondaryCache API (#11316) · 03ccb1cd

由 Peter Dillinger 提交于 4月 04, 2023

Summary:
I previously misread or misinterpreted API contracts for SecondaryCache and this should correct the record. (Follow-up item from https://github.com/facebook/rocksdb/issues/11301)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11316

Test Plan: comments only

Reviewed By: anand1976

Differential Revision: D44245107

Pulled By: pdillinger

fbshipit-source-id: 3f8ddec150674b75728f1730f99b963bbf7b76e7

03ccb1cd

Change default block cache from 8MB to 32MB (#11350) · 3c17930e

由 Peter Dillinger 提交于 4月 04, 2023

Summary:
... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.

Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.

The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350

Test Plan: unit tests updated

Reviewed By: cbi42

Differential Revision: D44674873

Pulled By: pdillinger

fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4

3c17930e

04 4月, 2023 2 次提交

Expose cache occupancy via C API (#11327) · e5a560ec

由 Niklas Fiekas 提交于 4月 03, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11327

Reviewed By: cbi42

Differential Revision: D44422225

Pulled By: ajkr

fbshipit-source-id: 3bfcf47290b3133c151bdfdd181896ba2e6be520

e5a560ec

Fix gflags_compat.h (#11346) · b4d78189

由 Peter Dillinger 提交于 4月 03, 2023

Summary:
Was getting compilation failure with old verison of gflags, examples in https://github.com/facebook/rocksdb/issues/11344. Perhaps this is new since enabling C++17. Getting rid of std::reference_wrapper from https://github.com/facebook/rocksdb/issues/10729 seems to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11346

Test Plan: manual, CI

Reviewed By: guowentian

Differential Revision: D44632776

Pulled By: pdillinger

fbshipit-source-id: 5c1f3f79a055698574538b6342c912a627b6d061

b4d78189

31 3月, 2023 2 次提交

Remove platform009 and default to platform010 (#11333) · 891ced8b

由 anand76 提交于 3月 30, 2023

Summary:
Platform009 is no longer supported in fbcode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11333

Reviewed By: pdillinger, ltamasi

Differential Revision: D44486431

Pulled By: anand1976

fbshipit-source-id: 99e19a70ebbb04ae750d39c33a110518bb25487e

891ced8b

Add `SetAllowStall()` (#11335) · 39c29372

由 Hui Xiao 提交于 3月 30, 2023

Summary:
**Context/Summary:**
- Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()`
- Misc: some clean up - see PR conversation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335

Test Plan: - New UT

Reviewed By: akankshamahajan15

Differential Revision: D44502555

Pulled By: hx235

fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e

39c29372

30 3月, 2023 1 次提交

Extend the stress test coverage of MultiGetEntity (#11336) · 0efd7b4b

由 Levi Tamasi 提交于 3月 29, 2023

Summary:
Similarly to `GetEntity` prior to https://github.com/facebook/rocksdb/issues/11303, the `MultiGetEntity` API is currently
only used in the DB verification logic of the stress tests. The patch introduces
a new mode where all point lookups are performed using `MultiGetEntity`,
and implements the corresponding logic in the non-batched, batched, and
CF consistency tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11336

Test Plan: Ran simple blackbox tests for the various stress test flavors.

Reviewed By: akankshamahajan15

Differential Revision: D44513285

Pulled By: ltamasi

fbshipit-source-id: c3db098501bf875b6a356b09fc676a0268d92c35

0efd7b4b

29 3月, 2023 1 次提交

Add experimental PerfContext counters for db iterator Prev/Next/Seek* APIs (#11320) · c14eb134

由 Hui Xiao 提交于 3月 28, 2023

Summary:
**Context/Summary:**
Motived by user need of investigating db iterator behavior during an interval of any time length of a certain thread, we decide to collect and expose related counters in `PerfContext` as an experimental feature, in addition to the existing db-scope ones (i.e, tickers)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11320

Test Plan:
- new UT
- db bench

Setup
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -compression_type=none -bloom_bits=3
```
Test till converges
```
./db_bench -seed=1679526311157283 -use_existing_db=1 -perf_level=2 -db=/dev/shm/testdb/ -benchmarks="seekrandom[-X60]"
```
pre-change
`seekrandom [AVG 33 runs] : 7545 (± 100) ops/sec`
post-change (no regression)
`seekrandom [AVG 33 runs] : 7688 (± 67) ops/sec`

Reviewed By: cbi42

Differential Revision: D44321931

Pulled By: hx235

fbshipit-source-id: f98a254ba3e3ced95eb5928884e33f1b99dca401

c14eb134

28 3月, 2023 2 次提交

Trivially move files down when opening db with level_compaction_dynamic_l… (#11321) · 60132016

由 Changyu Bi 提交于 3月 27, 2023

Summary:
…evel_bytes

During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true. Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes.

Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321

Test Plan:
1. Added unit tests.
2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB.

TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921.

Reviewed By: ajkr

Differential Revision: D44341930

Pulled By: cbi42

fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2

60132016

Add in-transaction multi-get-for-update to the C interface (#11107) · 40c2ec6d

由 karemta-orday 提交于 3月 27, 2023

Summary:
Hi, this is basically a part of https://github.com/facebook/rocksdb/pull/6488 that only adds `multi_get_for_update` functionality to C API (I'd like to call it from Rust), since `multi_get` was already added here https://github.com/facebook/rocksdb/pull/9252

https://github.com/facebook/rocksdb/pull/6488 has conflicts, so I guess it might be easier to get this one in

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11107

Reviewed By: pdillinger

Differential Revision: D42680764

Pulled By: ajkr

fbshipit-source-id: a50f96e1c7f3d470b4ab07e9ff5a283e5cf44865

40c2ec6d

23 3月, 2023 3 次提交

validate SstFileWriter range tombstones cover positive ranges (#11322) · 9f8cdc8a

由 Andrew Kryczka 提交于 3月 22, 2023

Summary:
As titled. This is the same as https://github.com/facebook/rocksdb/issues/6788 but for range tombstones written through `SstFileWriter` rather than through `DB`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11322

Reviewed By: cbi42

Differential Revision: D44317733

Pulled By: ajkr

fbshipit-source-id: f6eb8791ae2c09c169b6bfe0d047449d924b377e

9f8cdc8a

Backport an internal change to regression_build_test.sh (#11319) · 57abdea3

由 Levi Tamasi 提交于 3月 22, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11319

Reviewed By: cbi42

Differential Revision: D44308743

Pulled By: ltamasi

fbshipit-source-id: ffd054e9f4162797cfe1ef78240ad2501f78bbbd

57abdea3

Remove unused `#include <stdarg.h>` in include/rocksdb/c.h (#11302) · 8f6c2a2c

由 Tobias Ruck 提交于 3月 22, 2023

Summary:
This include is unused in the header. In one build environment of ours, stdarg.h is actually not present, and this include prevents us from building rocksdb dependencies.

We're currently monkey-patching this line out in our build script (still WIP), which of course is not good. https://github.com/raipay/rust-rocksdb/commit/ec2852caa3074a3309881acf26284a60672e0b1b

Note that removing this include might break builds in unexpected ways that include rocksdb/c.h and then use `va_start`, `va_end`, etc. However, if you're using these functions, you really should include stdarg.h yourself, so I don't think this should prevent this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11302

Reviewed By: ajkr

Differential Revision: D44139819

Pulled By: cbi42

fbshipit-source-id: 10c40b0b0260b23ccb7dc84e55a993c7dfbdc4cf

8f6c2a2c

22 3月, 2023 1 次提交

Deflake DBCompactionTest.CancelCompactionWaitingOnConflict (#11318) · b92bc04a

由 sdong 提交于 3月 21, 2023

Summary:
In DBCompactionTest::CancelCompactionWaitingOnConflict, when generating SST files to trigger a compaction, we don't wait after each file, which may cause multiple memtables going to the same SST file, causing insufficient files to trigger the compaction. We do the waiting instead, except the last one, which would trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11318

Test Plan: Run DBCompactionTest.CancelCompactionWaitingOnConflict multiple times.

Reviewed By: ajkr

Differential Revision: D44267273

fbshipit-source-id: 86af49b05fc67ea3335312f0f5f3d22df1520bf8

b92bc04a

21 3月, 2023 1 次提交

Disabling some IO error assertion in EnvLogger (#11314) · cea81cad

由 sdong 提交于 3月 20, 2023

Summary:
Right now, EnvLogger has the same IO error assertion as most other places: if we are writing to the file after we've seen an IO error, the assertion would trigger. This is too strict for info logger: we would not fail DB if info logger fails and we would try the best to continue logging. For now, we simplify the problem by disabling the assertion for EnvLogger.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11314

Test Plan: Run env_logger_test to make sure at least it doesn't fail in normal cases.

Reviewed By: anand1976

Differential Revision: D44227732

fbshipit-source-id: e3d31a221a5757f018a67ccaa96dcf89eb981f66

cea81cad

19 3月, 2023 3 次提交

Specify precedence in `SstFileWriter::DeleteRange()` API contract (#11309) · 8c445407

由 Andrew Kryczka 提交于 3月 18, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11309

Reviewed By: cbi42

Differential Revision: D44198501

Pulled By: ajkr

fbshipit-source-id: d603aca37b56aac5df255833793a3300807d63cf

8c445407

Updates for the 8.1 release (HISTORY, version.h, compatibility tests) (#11307) · 87de4fee

由 Levi Tamasi 提交于 3月 18, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11307

Reviewed By: hx235

Differential Revision: D44196571

Pulled By: ltamasi

fbshipit-source-id: 52489d6f8bd3c79cd33c87e9e1f719ea5e8bd382

87de4fee

New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) · cb584771

由 Hui Xiao 提交于 3月 18, 2023

Summary:
**Context/Summary:**
Users are interested in figuring out what has caused write stall.
- Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`)
- Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager
- Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped`

- Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map
- Misc: rename/comment some write stall InternalStats for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300

Test Plan:
- New UT
- Stress test
`python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1`
- Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time.
```
./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3
```
pre-change:
```
fillseq [AVG 15 runs] : 1176 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations;    0.5 MB/s
fillseq [AVG 16 runs] : 1162 (± 685) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 17 runs] : 1136 (± 646) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 18 runs] : 1118 (± 610) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 19 runs] : 1100 (± 578) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations;    0.3 MB/s
fillseq [AVG 20 runs] : 1071 (± 551) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 21 runs] : 1059 (± 525) ops/sec;    0.5 (± 0.3) MB/sec
```
post-change:
```
fillseq [AVG 15 runs] : 1226 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 16 runs] : 1196 (± 687) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 17 runs] : 1174 (± 647) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 18 runs] : 1156 (± 611) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 19 runs] : 1134 (± 579) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 20 runs] : 1119 (± 550) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 21 runs] : 1106 (± 524) ops/sec;    0.6 (± 0.3) MB/sec
```

Reviewed By: ajkr

Differential Revision: D44159541

Pulled By: hx235

fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b

cb584771

18 3月, 2023 2 次提交

HyperClockCache support for SecondaryCache, with refactoring (#11301) · 204fcff7

由 Peter Dillinger 提交于 3月 17, 2023

Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.

This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.

It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.

## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.

## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.

These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.

* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)

I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.

## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.

## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.

## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.

## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.

## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.

## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.

## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.

Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).

Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.

## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301

Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.

## Crash/stress test
Updated to use the new combination.

## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.

```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```

**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127

**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)

This is an acceptable difference IMHO.

Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.

Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec

So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:

**Before**:
multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec

Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.

Let's also look at Get() in db_bench:

```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
**After**:
readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec

That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:

**Before**:
readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
**After**:
readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec

Reviewed By: ltamasi

Differential Revision: D44177044

Pulled By: pdillinger

fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5

204fcff7

Ignore async_io ReadOption if FileSystem doesn't support it (#11296) · eac6b6d0

由 anand76 提交于 3月 17, 2023

Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296

Test Plan: Add new unit tests

Reviewed By: akankshamahajan15

Differential Revision: D44045776

Pulled By: anand1976

fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9

eac6b6d0

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功