提交 · 9d0cae71042be55a5429aa2e2ddc4be330995965 · kvdb / rocksdb

27 4月, 2022 4 次提交

Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) · 9d0cae71

由 Peter Dillinger 提交于 4月 26, 2022

Summary:
When MultiGet() determines that multiple query keys can be
served by examining the same data block in block cache (one Lookup()),
each PinnableSlice referring to data in that data block needs to hold
on to the block in cache so that they can be released at arbitrary
times by the API user. Historically this is accomplished with extra
calls to Ref() on the Handle from Lookup(), with each PinnableSlice
cleanup calling Release() on the Handle, but this creates extra
contention on the block cache for the extra Ref()s and Release()es,
especially because they hit the same cache shard repeatedly.

In the case of merge operands (possibly more cases?), the problem was
compounded by doing an extra Ref()+eventual Release() for each merge
operand for a key reusing a block (which could be the same key!), rather
than one Ref() per key. (Note: the non-shared case with `biter` was
already one per key.)

This change optimizes MultiGet not to rely on these extra, contentious
Ref()+Release() calls by instead, in the shared block case, wrapping
the cache Release() cleanup in a refcounted object referenced by the
PinnableSlices, such that after the last wrapped reference is released,
the cache entry is Release()ed. Relaxed atomic refcounts should be
much faster than mutex-guarded Ref() and Release(), and much less prone
to a performance cliff when MultiGet() does a lot of block sharing.

Note that I did not use std::shared_ptr, because that would require an
extra indirection object (shared_ptr itself new/delete) in order to
associate a ref increment/decrement with a Cleanable cleanup entry. (If
I assumed it was the size of two pointers, I could do some hackery to
make it work without the extra indirection, but that's too fragile.)

Some details:
* Fixed (removed) extra block cache tracing entries in cases of cache
entry reuse in MultiGet, but it's likely that in some other cases traces
are missing (XXX comment inserted)
* Moved existing implementations for cleanable.h from iterator.cc to
new cleanable.cc
* Improved API comments on Cleanable
* Added a public SharedCleanablePtr class to cleanable.h in case others
could benefit from the same pattern (potentially many Cleanables and/or
smart pointers referencing a shared Cleanable)
* Add a typedef for MultiGetContext::Mask
* Some variable renaming for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899

Test Plan:
Added unit tests for SharedCleanablePtr.

Greatly enhanced ability of existing tests to detect cache use-after-free.
* Release PinnableSlices from MultiGet as they are read rather than in
bulk (in db_test_util wrapper).
* In ASAN build, default to using a trivially small LRUCache for block_cache
so that entries are immediately erased when unreferenced. (Updated two
tests that depend on caching.) New ASAN testsuite running time seems
OK to me.

If I introduce a bug into my implementation where we skip the shared
cleanups on block reuse, ASAN detects the bug in
`db_basic_test *MultiGet*`. If I remove either of the above testing
enhancements, the bug is not detected.

Consider for follow-up work: manipulate or randomize ordering of
PinnableSlice use and release from MultiGet db_test_util wrapper. But in
typical cases, natural ordering gives pretty good functional coverage.

Performance test:
In the extreme (but possible) case of MultiGetting the same or adjacent keys
in a batch, throughput can improve by an order of magnitude.
`./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
Before ops/sec, num=5: 1,384,394
Before ops/sec, num=500: 6,423,720
After ops/sec, num=500: 10,658,794
After ops/sec, num=5: 16,027,257

Also note that previously, with high parallelism, having query keys
concentrated in a single block was worse than spreading them out a bit. Now
concentrated in a single block is faster than spread out, which is hopefully
consistent with natural expectation.

Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
Before: multireadrandom [AVG    999 runs] : 1088699 (± 7344) ops/sec;  120.4 (± 0.8 ) MB/sec
After: multireadrandom [AVG    999 runs] : 1090402 (± 7230) ops/sec;  120.6 (± 0.8 ) MB/sec
Possibly better, possibly in the noise.

Reviewed By: anand1976

Differential Revision: D35907003

Pulled By: pdillinger

fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e

9d0cae71

fix clang-analyze in corruption_test (#9908) · ce2d8a42

由 Andrew Kryczka 提交于 4月 26, 2022

Summary:
This PR fixes a clang-analyze error that I introduced in https://github.com/facebook/rocksdb/issues/9906:

```
db/corruption_test.cc:358:15: warning: Called C++ object pointer is null
    ASSERT_OK(db_->Put(WriteOptions(), cfhs[0], "k", "v"));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./test_util/testharness.h:76:62: note: expanded from macro 'ASSERT_OK'
  ASSERT_PRED_FORMAT1(ROCKSDB_NAMESPACE::test::AssertStatus, s)
                                                             ^
third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19909:36: note: expanded
from macro 'ASSERT_PRED_FORMAT1'
  GTEST_PRED_FORMAT1_(pred_format, v1, GTEST_FATAL_FAILURE_)
                                   ^~
third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19892:34: note: expanded
from macro 'GTEST_PRED_FORMAT1_'
  GTEST_ASSERT_(pred_format(#v1, v1), \
                                 ^~
third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19868:52: note: expanded
from macro 'GTEST_ASSERT_'
  if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                   ^~~~~~~~~~
1 warning generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9908

Reviewed By: riversand963

Differential Revision: D35953147

Pulled By: ajkr

fbshipit-source-id: 9b837bd7581c6e1e2cdbc961c099652256eb9d4b

ce2d8a42

Add mmap DBGet microbench parameters (#9903) · 1eb279dc

由 Andrew Kryczka 提交于 4月 26, 2022

Summary:
I tried evaluating https://github.com/facebook/rocksdb/issues/9611 using DBGet microbenchmarks but mostly found the change is well within the noise even for hundreds of repetitions; meanwhile, the InternalKeyComparator CPU it saves is 1-2% according to perf so it should be measurable. In this PR I tried adding a mmap mode that will bypass compression/checksum/block cache/file read to focus more on the block lookup paths, and also increased the Get() count.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9903

Reviewed By: jay-zhuang, riversand963

Differential Revision: D35907375

Pulled By: ajkr

fbshipit-source-id: 69490d5040ef0863e1ce296724104d0aa7667215

1eb279dc

Revert open logic changes in #9634 (#9906) · c5d367f4

由 Andrew Kryczka 提交于 4月 26, 2022

Summary:
Left HISTORY.md and unit tests.
Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906

Reviewed By: riversand963

Differential Revision: D35940093

Pulled By: ajkr

fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b

c5d367f4

26 4月, 2022 4 次提交

由 Akanksha Mahajan 提交于 4月 25, 2022

Summary:
Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS.
PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by
FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling
file_system Poll API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845

Test Plan: Update existing tests

Reviewed By: anand1976

Differential Revision: D35909694

Pulled By: akankshamahajan15

fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36

3653029d

Bugfix/fix manual flush blocking bug (#9893) · 6d2577e5

由 RoeyMaor 提交于 4月 25, 2022

Summary:
Fix https://github.com/facebook/rocksdb/issues/9892

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893

Reviewed By: jay-zhuang

Differential Revision: D35880959

Pulled By: ajkr

fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a

6d2577e5

Add 95% confidence intervals to db_bench output (#9882) · fb9a167a

由 Jaromir Vanek 提交于 4月 25, 2022

Summary:
Enhancing `db_bench` output with 95% statistical confidence intervals for better performance evaluation. The goal is to unambiguously separate random variance when running benchmark over multiple iterations.

Output enhanced with confidence intervals exposed in brackets:

```
$ ./db_bench --benchmarks=fillseq[-X10]

Running benchmark for 10 times
fillseq      :       4.961 micros/op 201578 ops/sec;   22.3 MB/s
fillseq      :       5.030 micros/op 198824 ops/sec;   22.0 MB/s
fillseq [AVG 2 runs] : 200201 (± 2698) ops/sec;   22.1 (± 0.3) MB/sec
fillseq      :       4.963 micros/op 201471 ops/sec;   22.3 MB/s
fillseq [AVG 3 runs] : 200624 (± 1765) ops/sec;   22.2 (± 0.2) MB/sec
fillseq      :       5.035 micros/op 198625 ops/sec;   22.0 MB/s
fillseq [AVG 4 runs] : 200124 (± 1586) ops/sec;   22.1 (± 0.2) MB/sec
fillseq      :       4.979 micros/op 200861 ops/sec;   22.2 MB/s
fillseq [AVG 5 runs] : 200272 (± 1262) ops/sec;   22.2 (± 0.1) MB/sec
fillseq      :       4.893 micros/op 204367 ops/sec;   22.6 MB/s
fillseq [AVG 6 runs] : 200954 (± 1688) ops/sec;   22.2 (± 0.2) MB/sec
fillseq      :       4.914 micros/op 203502 ops/sec;   22.5 MB/s
fillseq [AVG 7 runs] : 201318 (± 1595) ops/sec;   22.3 (± 0.2) MB/sec
fillseq      :       4.998 micros/op 200074 ops/sec;   22.1 MB/s
fillseq [AVG 8 runs] : 201163 (± 1415) ops/sec;   22.3 (± 0.2) MB/sec
fillseq      :       4.946 micros/op 202188 ops/sec;   22.4 MB/s
fillseq [AVG 9 runs] : 201277 (± 1267) ops/sec;   22.3 (± 0.1) MB/sec
fillseq      :       5.093 micros/op 196331 ops/sec;   21.7 MB/s
fillseq [AVG 10 runs] : 200782 (± 1491) ops/sec;   22.2 (± 0.2) MB/sec
fillseq [AVG    10 runs] : 200782 (± 1491) ops/sec;   22.2 (± 0.2) MB/sec
fillseq [MEDIAN 10 runs] : 201166 ops/sec;   22.3 MB/s
```

For more explicit interval representation, use `--confidence_interval_only` flag:

```
$ ./db_bench --benchmarks=fillseq[-X10] --confidence_interval_only

Running benchmark for 10 times
fillseq      :       4.935 micros/op 202648 ops/sec;   22.4 MB/s
fillseq      :       5.078 micros/op 196943 ops/sec;   21.8 MB/s
fillseq [CI95 2 runs] : (194205, 205385) ops/sec; (21.5, 22.7) MB/sec
fillseq      :       5.159 micros/op 193816 ops/sec;   21.4 MB/s
fillseq [CI95 3 runs] : (192735, 202869) ops/sec; (21.3, 22.4) MB/sec
fillseq      :       4.947 micros/op 202158 ops/sec;   22.4 MB/s
fillseq [CI95 4 runs] : (194721, 203061) ops/sec; (21.5, 22.5) MB/sec
fillseq      :       4.908 micros/op 203756 ops/sec;   22.5 MB/s
fillseq [CI95 5 runs] : (196113, 203615) ops/sec; (21.7, 22.5) MB/sec
fillseq      :       5.063 micros/op 197528 ops/sec;   21.9 MB/s
fillseq [CI95 6 runs] : (196319, 202631) ops/sec; (21.7, 22.4) MB/sec
fillseq      :       5.214 micros/op 191799 ops/sec;   21.2 MB/s
fillseq [CI95 7 runs] : (194953, 201803) ops/sec; (21.6, 22.3) MB/sec
fillseq      :       5.260 micros/op 190095 ops/sec;   21.0 MB/s
fillseq [CI95 8 runs] : (193749, 200937) ops/sec; (21.4, 22.2) MB/sec
fillseq      :       5.076 micros/op 196992 ops/sec;   21.8 MB/s
fillseq [CI95 9 runs] : (194134, 200474) ops/sec; (21.5, 22.2) MB/sec
fillseq      :       5.388 micros/op 185603 ops/sec;   20.5 MB/s
fillseq [CI95 10 runs] : (192487, 199781) ops/sec; (21.3, 22.1) MB/sec
fillseq [AVG    10 runs] : 196134 (± 3647) ops/sec;   21.7 (± 0.4) MB/sec
fillseq [MEDIAN 10 runs] : 196968 ops/sec;   21.8 MB/sec
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9882

Reviewed By: pdillinger

Differential Revision: D35796148

Pulled By: vanekjar

fbshipit-source-id: 8313712d16728ff982b8aff28195ee56622385b8

fb9a167a

Add experimental new FS API AbortIO to cancel read request (#9901) · 5bd374b3

由 Akanksha Mahajan 提交于 4月 25, 2022

Summary:
Add experimental new API AbortIO in FileSystem to abort the
read requests submitted asynchronously through ReadAsync API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9901

Test Plan: Existing tests

Reviewed By: anand1976

Differential Revision: D35885591

Pulled By: akankshamahajan15

fbshipit-source-id: df3944e6e9e6e487af1fa688376b4abb6837fb02

5bd374b3

23 4月, 2022 1 次提交

Add blob dump support to the dump_live_files command (#9896) · ac296457

由 yuzhangyu 提交于 4月 22, 2022

Summary:
This patch completes the second part of the task: "Add blob support to the dump and dump_live_files command"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9896

Reviewed By: ltamasi

Differential Revision: D35852667

Pulled By: jowlyzhang

fbshipit-source-id: a006456c881f468a92da689e895134762e9574e1

ac296457

22 4月, 2022 1 次提交

Add blob dump support to the dump command (#9881) · fff28a77

由 yuzhangyu 提交于 4月 21, 2022

Summary:
This patch is the first part of adding blob dump support. It only adds blob dump support to the dump command. A follow up patch will add blob dump support to the dump_live_files command.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9881

Reviewed By: ltamasi

Differential Revision: D35796731

Pulled By: jowlyzhang

fbshipit-source-id: 2cc5973b222d505a331ac7b969edcf992b47c5ee

fff28a77

21 4月, 2022 4 次提交

Add rollback_deletion_type_callback to TxnDBOptions (#9873) · d13825e5

由 Yanqin Jin 提交于 4月 20, 2022

Summary:
This PR does not affect write-committed.

Add a member, `rollback_deletion_type_callback` to TransactionDBOptions
so that a write-prepared transaction, when rolling back, can call this
callback to decide if a `Delete` or `SingleDelete` should be used to
cancel a prior `Put` written to the database during prepare phase.

The purpose of this PR is to prevent mixing `Delete` and `SingleDelete`
for the same key, causing undefined behaviors. Without this PR, the
following can happen:

```
// The application always issues SingleDelete when deleting keys.

txn1->Put('a');
txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax
txn1->Rollback();  // inserts DELETE('a')

txn2->Put('a');
txn2->Commit();  // writes to memtable and potentially gets flushed/compacted
```

In the database, we may have
```
L0:   [PUT('a', s=100)]
L1:   [DELETE('a', s=90)]
Lmax: [PUT('a', s=0)]
```

If a compaction compacts L0 and L1, then we have
```
L1:    [PUT('a', s=100)]
Lmax:  [PUT('a', s=0)]
```

If a future transaction issues a SingleDelete, we have
```
L0:    [SD('a', s=110)]
L1:    [PUT('a', s=100)]
Lmax:  [PUT('a', s=0)]
```

Then, a compaction including L0, L1 and Lmax leads to
```
Lmax:  [PUT('a', s=0)]
```

which is incorrect.

Similar bugs reported and addressed in
https://github.com/cockroachdb/pebble/issues/1255. Based on our team's
current priority, we have decided to take this approach for now. We may
come back and revisit in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35762170

Pulled By: riversand963

fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d

d13825e5

Mark GetLiveFilesStorageInfo ready for production use (#9868) · 1bac873f

由 Peter Dillinger 提交于 4月 20, 2022

Summary:
... by filling out remaining testing hole: handling of
db_pathsi+cf_paths. (Note that while GetLiveFilesStorageInfo works
with db_paths / cf_paths, Checkpoint and BackupEngine do not and
are marked appropriately.)

Also improved comments for "live files" APIs, and grouped them
together in db.h.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9868

Test Plan: Adding to existing unit tests

Reviewed By: jay-zhuang

Differential Revision: D35752254

Pulled By: pdillinger

fbshipit-source-id: c70eb67748fad61826e2f554b674638700abefb2

1bac873f

Add 7.2 to compatible check (#9858) · 2ea4205a

由 Jay Zhuang 提交于 4月 20, 2022

Summary:
Add 7.2 to compatible check (should change it with version update).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9858

Reviewed By: riversand963

Differential Revision: D35722897

Pulled By: jay-zhuang

fbshipit-source-id: 08c782b9344599d7296543eb0c61afcd9a869a1a

2ea4205a

Add --decode_blob_index option to idump and dump commands (#9870) · 9b5790f0

由 yuzhangyu 提交于 4月 20, 2022

Summary:
This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870

Reviewed By: ltamasi

Differential Revision: D35753932

Pulled By: jowlyzhang

fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9

9b5790f0

20 4月, 2022 4 次提交

Fix issue of opening too many files in... · a5063c89

由 Hui Xiao 提交于 4月 19, 2022

Fix issue of opening too many files in BlockBasedTableReaderCapMemoryTest.CapMemoryUsageUnderCacheCapacity (#9869)

Summary:
**Context:**
Unit test for https://github.com/facebook/rocksdb/pull/9748 keeps opening new files to see whether the new feature is able to correctly constrain the opening based on block cache capacity.

However, the unit test has two places written inefficiently that can lead to opening too many new files relative to underlying operating system/file system constraint, even before hitting the block cache capacity:
(1) [opened_table_reader_num < 2 * max_table_reader_num](https://github.com/facebook/rocksdb/pull/9748/files?show-viewed-files=true&file-filters%5B%5D=#diff-ec9f5353e317df71093094734ba29193b94a998f0f9c9af924e4c99692195eeaR438), which can leads to 1200 + open files because of (2) below
(2) NewLRUCache(6 * CacheReservationManagerImpl<CacheEntryRole::kBlockBasedTableReader>::GetDummyEntrySize()) in [here](https://github.com/facebook/rocksdb/pull/9748/files?show-viewed-files=true&file-filters%5B%5D=#diff-ec9f5353e317df71093094734ba29193b94a998f0f9c9af924e4c99692195eeaR364)

Therefore we see CI failures like this on machine with a strict open file limit ~1000 (see the "table_1021" naming in following error msg)
https://app.circleci.com/pipelines/github/facebook/rocksdb/12886/workflows/75524682-3fa4-41ee-9a61-81827b51d99b/jobs/345270
```
fs_->NewWritableFile(path, foptions, &file, nullptr)
IO error: While open a file for appending: /dev/shm/rocksdb.Jedwt/run-block_based_table_reader_test-CapMemoryUsageUnderCacheCapacity-BlockBasedTableReaderCapMemoryTest.CapMemoryUsageUnderCacheCapacity-0/block_based_table_reader_test_1668910_829492452552920927/**table_1021**: Too many open files
```

**Summary:**
- Revised the test more efficiently on the above 2 places,  including using 1.1 instead 2 in the threshold and lowering down the block cache capacity a bit
- Renamed some variables for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9869

Test Plan:
- Manual inspection of max opened table reader in all test case, which is around ~389
- Circle CI to see if error is gone

Reviewed By: ajkr

Differential Revision: D35752655

Pulled By: hx235

fbshipit-source-id: 8a0953d39d561babfa4257b8ed8550bb21b04839

a5063c89

Add release note for #9747 (#9874) · 01fdec23

由 Bo Wang 提交于 4月 19, 2022

Summary:
Add release note for CompressedSecondaryCache and the update of SecondaryCache::Lookup().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9874

Reviewed By: jay-zhuang

Differential Revision: D35765973

Pulled By: gitbw95

fbshipit-source-id: 98232508c4f2047216def9c11a038cfb98709690

01fdec23

Release note for #9546 (#9872) · 682fc8ba

由 Peter Dillinger 提交于 4月 19, 2022

Summary:
We don't really have a mechanism for internal-only release
notes, so adding this to the standard release notes. For picking into
7.2 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9872

Test Plan: release note only

Reviewed By: jay-zhuang

Differential Revision: D35761307

Pulled By: pdillinger

fbshipit-source-id: 5d1932767fff48456323df948604dbb956ac27b2

682fc8ba

Add C API for setting `strict_capacity_limit` (#9855) · bbf58673

由 Federico Guerinoni 提交于 4月 19, 2022

Summary:
This allows to set with true the field `strict_capacity_limit` from C
API and other languages that wrap that.
Signed-off-by: NFederico Guerinoni <guerinoni.federico@gmail.com>

Closes: https://github.com/facebook/rocksdb/issues/9707

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9855

Reviewed By: ajkr

Differential Revision: D35724150

Pulled By: jay-zhuang

fbshipit-source-id: d8514797e9d90b1cd88329018f9ac4776722aa0f

bbf58673

19 4月, 2022 5 次提交

Avoid overwriting OPTIONS file settings in db_bench (#9862) · 690f1edf

由 Andrew Kryczka 提交于 4月 18, 2022

Summary:
`InitializeOptionsGeneral()` was overwriting many options that were already configured by OPTIONS file, potentially with the flag default values. This PR changes that function to only overwrite options in limited scenarios, as described at the top of its definition. Block cache is still a violation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9862

Test Plan: ran under various scenarios (multi-DB, single DB, OPTIONS file, flags) and verified options are set as expected

Reviewed By: jay-zhuang

Differential Revision: D35736960

Pulled By: ajkr

fbshipit-source-id: 75b77740af37e6f5741618f8a8f5685df2417d03

690f1edf

Misc CI improvements / additions (#9859) · 1601433b

由 Peter Dillinger 提交于 4月 18, 2022

Summary:
* Add valgrind test to nightly CircleCI (in case it can catch something that
ASAN/UBSAN does not)
* Add clang13+asan+ubsan+folly test to nightly CircleCI, for broader testing
* Consolidate many copies of ASAN_OPTIONS= while also allowing it to be
inherited from parent environment rather than always overridden.
* Move UBSAN exclusion from Makefile into options_settable_test.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9859

Test Plan: CI

Reviewed By: jay-zhuang

Differential Revision: D35730903

Pulled By: pdillinger

fbshipit-source-id: 6f5464034e8115f9a07f6f7aec1de9219ec2837c

1601433b

Conditionally declare and define variable that is unused in LITE mode (#9854) · e83c5543

由 Hui Xiao 提交于 4月 18, 2022

Summary:
Context:
As mentioned in https://github.com/facebook/rocksdb/issues/9701, we have the following in LITE=1 make static_lib for v7.0.2
```
  CC       file/sequence_file_reader.o
  CC       file/sst_file_manager_impl.o
  CC       file/writable_file_writer.o
In file included from file/writable_file_writer.cc:10:
./file/writable_file_writer.h:163:15: error: private field 'temperature_' is not used [-Werror,-Wunused-private-field]
  Temperature temperature_;
              ^
1 error generated.
make: *** [file/writable_file_writer.o] Error 1
```

 as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9854

Test Plan:
- Local `LITE=1 make static_lib` reveals the same error and error is gone after this fix
- CI

Reviewed By: ajkr, jay-zhuang

Differential Revision: D35706585

Pulled By: hx235

fbshipit-source-id: 7743310298231ad6866304ffa2225c8abdc91d9a

e83c5543

Add "no compression" job to CircleCI (#9850) · 41237dd3

由 Peter Dillinger 提交于 4月 18, 2022

Summary:
Since they operate at distinct abstraction layers, I thought it
was prudent to combine with EncryptedEnv CI test for each PR, for efficiency
in testing. Also added supported compressions to sst_dump --help output
so that CI job can verify no compiled-in compression support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9850

Test Plan: CI, some manual stuff

Reviewed By: riversand963

Differential Revision: D35682346

Pulled By: pdillinger

fbshipit-source-id: be9879c1533fed304ee32c89fd9ba4b07c2b90cc

41237dd3

Update main version.h to NEXT release (7.3) (#9852) · 3d473235

由 Jay Zhuang 提交于 4月 18, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9852

Reviewed By: ajkr

Differential Revision: D35694753

Pulled By: jay-zhuang

fbshipit-source-id: 729d416afc588e5db2367e899589bbb5419820d6

3d473235

17 4月, 2022 1 次提交

Update HISTORY.md for 7.2 release (#9848) · 673ada82

由 Jay Zhuang 提交于 4月 16, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9848

Reviewed By: riversand963

Differential Revision: D35677606

Pulled By: jay-zhuang

fbshipit-source-id: 8a597ea47f302a6f51fb6672a33c848d613bccfc

673ada82

16 4月, 2022 7 次提交

Add Aggregation Merge Operator (#9780) · 4f9c0fd0

由 sdong 提交于 4月 15, 2022

Summary:
Add a merge operator that allows users to register specific aggregation function so that they can does aggregation based per key using different aggregation types.
See comments of function CreateAggMergeOperator() for actual usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9780

Test Plan: Add a unit test to coverage various cases.

Reviewed By: ltamasi

Differential Revision: D35267444

fbshipit-source-id: 5b02f31c4f3e17e96dd4025cdc49fca8c2868628

4f9c0fd0

Propagate errors from UpdateBoundaries (#9851) · db536ee0

由 Levi Tamasi 提交于 4月 15, 2022

Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)

This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D35683671

Pulled By: ltamasi

fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6

db536ee0

Add a `fail_if_not_bottommost_level` to IngestExternalFileOptions (#9849) · be81609b

由 Yanqin Jin 提交于 4月 15, 2022

Summary:
This new options allows application to specify that files must be
ingested to bottommost level, otherwise the ingestion will fail instead
of silently ingesting to a non-bottommost level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9849

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D35680307

Pulled By: riversand963

fbshipit-source-id: 01cf54ef6c76198f7654dc06b5544631dea1be1e

be81609b

Make initial auto readahead_size configurable (#9836) · 0c7f455f

由 Akanksha Mahajan 提交于 4月 15, 2022

Summary:
Make initial auto readahead_size configurable

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9836

Test Plan:
Added new unit test
Ran regression:
Without change:

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

With this change:
```
 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Set seed to 1649895440554504 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.2
Date:       Wed Apr 13 17:17:20 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
... finished 100 ops
seekrandom   :  476892.488 micros/op 2 ops/sec;  344.6 MB/s (252 of 252 found)
```

Reviewed By: anand1976

Differential Revision: D35632815

Pulled By: akankshamahajan15

fbshipit-source-id: c8057a88f9294c9d03b1d434b03affe02f74d796

0c7f455f

Upgrade development environment. (#9843) · d5dfa8c6

由 sdong 提交于 4月 15, 2022

Summary:
It's to support Meta's internal environment platform010. Gcc still doesn't work but USE_CLANG=1 should work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9843

Test Plan: Try to make and ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_CLANG=1 make

Reviewed By: pdillinger

Differential Revision: D35652507

fbshipit-source-id: a4a14b2fa4a2d6ca6fbf1b65060e81c39f079363

d5dfa8c6

Remove flaky servicelab metrics DBPut P95/P99 (#9844) · e91ec64c

由 Jay Zhuang 提交于 4月 15, 2022

Summary:
The P95 and P99 metrics are flaky, similar to DBGet ones which removed
in https://github.com/facebook/rocksdb/issues/9742 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9844

Test Plan: `$ ./buckifier/buckify_rocksdb.py`

Reviewed By: ajkr

Differential Revision: D35655531

Pulled By: jay-zhuang

fbshipit-source-id: c1409f0fba4e23d461a65f988c27ac5e2ae85d13

e91ec64c

Add option --decode_blob_index to dump_live_files command (#9842) · 082eb042

由 yuzhangyu 提交于 4月 15, 2022

Summary:
This change only add decode blob index support to dump_live_files command, which is part of a task to add blob support to a few commands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9842

Reviewed By: ltamasi

Differential Revision: D35650167

Pulled By: jowlyzhang

fbshipit-source-id: a78151b98bc38ac6f52c6e01ca6927a3429ddd14

082eb042

15 4月, 2022 4 次提交

Add checks to GetUpdatesSince (#9459) · fe63899d

由 Yanqin Jin 提交于 4月 14, 2022

Summary:
Make `DB::GetUpdatesSince` return early if told to scan WALs generated by transactions
with write-prepared or write-unprepared policies (`seq_per_batch` is true), as indicated by
API comment.

Also add checks to `TransactionLogIterator` to clarify some conditions.

No API change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9459

Test Plan:
make check

Closing https://github.com/facebook/rocksdb/issues/1565

Reviewed By: akankshamahajan15

Differential Revision: D33821243

Pulled By: riversand963

fbshipit-source-id: c8b155d020ce0980e2d3b3b1da40b96e65b48d79

fe63899d

CompactionIterator sees consistent view of which keys are committed (#9830) · 0bd4dcde

由 Yanqin Jin 提交于 4月 14, 2022

Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**

`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:

```
inline bool KeyCommitted(SequenceNumber seq) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```

With that being said, we focus on write-prepared/write-unprepared transactions.

A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.

Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.

To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.

```
inline bool KeyCommitted(SequenceNumber sequence) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
             SnapshotCheckerResult::kInSnapshot;
}
```

As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35561162

Pulled By: riversand963

fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f

0bd4dcde

Fix minimum libzstd version that supports ZSTD_STREAMING (#9841) · 844a3510

由 Jonathan Albrecht 提交于 4月 14, 2022

Summary:
The minimum libzstd version that has `ZSTD_compressStream2` is
1.4.0 so only define ZSTD_STREAMING in that case.

Fixes building on Ubuntu 18.04 which has libzstd 1.3.3 as its
repository version.

Fixes https://github.com/facebook/rocksdb/issues/9795

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9841

Test Plan:
Build and test on Ubuntu 18.04 with:
  apt-get install libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \
    libzstd-dev libgflags-dev g++ make curl

Reviewed By: ajkr

Differential Revision: D35648738

Pulled By: jay-zhuang

fbshipit-source-id: 2a9e969bcc17a7dc10172f3817283409de885811

844a3510

Expose `CacheEntryRole` and map keys for block cache stat collections (#9838) · d6e016be

由 Andrew Kryczka 提交于 4月 14, 2022

Summary:
This gives users the ability to examine the map populated by `GetMapProperty()` with property `kBlockCacheEntryStats`. It also sets us up for a possible future where cache reservations are configured according to `CacheEntryRole`s rather than flags coupled to roles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9838

Test Plan:
- migrated test DBBlockCacheTest.CacheEntryRoleStats to use this API. That test verifies some of the contents are as expected
- added a DBPropertiesTest to verify the public map keys are present, and nothing else

Reviewed By: hx235

Differential Revision: D35629493

Pulled By: ajkr

fbshipit-source-id: 5c4356b8560e85d1f881fd32c44c15960b02fc68

d6e016be

14 4月, 2022 5 次提交

Add db_stress to buck build (#9840) · fefacd33

由 Peter Dillinger 提交于 4月 13, 2022

Summary:
For internal testing purposes (minimal deps)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9840

Test Plan: buck build :db_stress

Reviewed By: hx235

Differential Revision: D35635192

Pulled By: pdillinger

fbshipit-source-id: eefca3bcea174de6fdcdc1c763774f3134c7342c

fefacd33

Serialize a space-hungry test (#9837) · b3a6fb7e

由 Peter Dillinger 提交于 4月 13, 2022

Summary:
Tends to fill up /dev/shm

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9837

Test Plan: Some manual testing

Reviewed By: hx235

Differential Revision: D35627568

Pulled By: pdillinger

fbshipit-source-id: 22710f7b10bc287570475dae42318dd346f78db9

b3a6fb7e

Expose the amount of garbage in live blob files as a dedicated DB property (#9835) · 56452077

由 Levi Tamasi 提交于 4月 13, 2022

Summary:
This information has been already available as part of the `rocksdb.blob-stats`
string property. The patch adds a dedicated integer property to make it easier
to surface this information in monitoring systems.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9835

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D35619495

Pulled By: ltamasi

fbshipit-source-id: 03fb0b228aa27d3859a1e3783bcb7eca095607f8

56452077

Support canceling running RemoteCompaction on remote side (#9725) · dc1c90c4

由 Jay Zhuang 提交于 4月 13, 2022

Summary:
Add the ability to cancel remote compaction on the remote side by
setting `OpenAndCompactOptions.canceled` to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9725

Test Plan: added unittest

Reviewed By: ajkr

Differential Revision: D35018800

Pulled By: jay-zhuang

fbshipit-source-id: be3652f9645e0347df429e42a5614d5a9b3a1ec4

dc1c90c4

Update supported VS versions in INSTALL.md (#9823) · 9454e744

由 Siying Dong 提交于 4月 13, 2022

Summary:
We only run CI for VS2017 and VS2019 now, so the claim that users can build with "VS13" is stale.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9823

Reviewed By: riversand963

Differential Revision: D35511401

fbshipit-source-id: e3ae2643e26ab46753fea439599d2ed98abba439

9454e744

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功