提交 · e3ef41b02f3da0da3d337d7e84a12eac410b9b98 · kvdb / rocksdb

02 3月, 2022 2 次提交

Use released clang-format instead of the one from dev branch (#9646) · e3ef41b0

由 Jay Zhuang 提交于 3月 01, 2022

Summary:
We should use the released clang-format version instead of the one from
dev branch. Otherwise the format report could be inconsistent with local
development env and CI. e.g.: https://github.com/facebook/rocksdb/issues/9644

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9646

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D34554065

Pulled By: jay-zhuang

fbshipit-source-id: b841bc400becb4272be18c803eb03a7a1172da6f

e3ef41b0

Fix pointer to jlong conversion in 32 bit OS (#9396) · 06c8afef

由 Si Ke 提交于 3月 01, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9396

Reviewed By: jay-zhuang

Differential Revision: D34529654

Pulled By: pdillinger

fbshipit-source-id: cf62152ba86b02f9ffa7780f370ad49089e56a0b

06c8afef

01 3月, 2022 3 次提交

Improve build detect for RISCV (#9366) · 7d7e88c7

由 Adam Retter 提交于 3月 01, 2022

Summary:
Related to: https://github.com/facebook/rocksdb/pull/9215

* Adds build_detect_platform support for RISCV on Linux (at least on SiFive Unmatched platforms)

This still leaves some linking issues on RISCV remaining (e.g. when building `db_test`):
```
/usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `__gnu_cxx::new_allocator<char>::deallocate(char*, unsigned long)':
/usr/include/c++/10/ext/new_allocator.h:133: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(memtable.o): in function `std::__atomic_base<bool>::compare_exchange_weak(bool&, bool, std::memory_order, std::memory_order)':
/usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: /usr/include/c++/10/bits/atomic_base.h:464: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(memtable.o):/usr/include/c++/10/bits/atomic_base.h:464: more undefined references to `__atomic_compare_exchange_1' follow
/usr/bin/ld: ./librocksdb_debug.a(db_impl.o): in function `rocksdb::DBImpl::NewIteratorImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyData*, unsigned long, rocksdb::ReadCallback*, bool, bool)':
/home/adamretter/rocksdb/db/db_impl/db_impl.cc:3019: undefined reference to `__atomic_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::Writer::CreateMutex()':
/home/adamretter/rocksdb/./db/write_thread.h:205: undefined reference to `__atomic_compare_exchange_1'
/usr/bin/ld: ./librocksdb_debug.a(write_thread.o): in function `rocksdb::WriteThread::SetState(rocksdb::WriteThread::Writer*, unsigned char)':
/home/adamretter/rocksdb/db/write_thread.cc:222: undefined reference to `__atomic_compare_exchange_1'
collect2: error: ld returned 1 exit status
make: *** [Makefile:1449: db_test] Error 1
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9366

Reviewed By: jay-zhuang

Differential Revision: D34377664

Pulled By: mrambacher

fbshipit-source-id: c86f9d0cd1cb0c18de72b06f1bf5847f23f51118

7d7e88c7

Handle failures in block-based table size/offset approximation (#9615) · 0a89cea5

由 Andrew Kryczka 提交于 2月 28, 2022

Summary:
In crash test with fault injection, we were seeing stack traces like the following:

```
https://github.com/facebook/rocksdb/issues/3 0x00007f75f763c533 in __GI___assert_fail (assertion=assertion@entry=0x1c5b2a0 "end_offset >= start_offset", file=file@entry=0x1c580a0 "table/block_based/block_based_table_reader.cc", line=line@entry=3245,
function=function@entry=0x1c60e60 "virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller)") at assert.c:101
https://github.com/facebook/rocksdb/issues/4 0x00000000010ea9b4 in rocksdb::BlockBasedTable::ApproximateSize (this=<optimized out>, start=..., end=..., caller=<optimized out>) at table/block_based/block_based_table_reader.cc:3224
https://github.com/facebook/rocksdb/issues/5 0x0000000000be61fb in rocksdb::TableCache::ApproximateSize (this=0x60f0000161b0, start=..., end=..., fd=..., caller=caller@entry=rocksdb::kCompaction, internal_comparator=..., prefix_extractor=...) at db/table_cache.cc:719
https://github.com/facebook/rocksdb/issues/6 0x0000000000c3eaec in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, v=<optimized out>, f=..., start=..., end=..., caller=<optimized out>) at ./db/version_set.h:850
https://github.com/facebook/rocksdb/issues/7 0x0000000000c6ebc3 in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, options=..., v=v@entry=0x621000047500, start=..., end=..., start_level=start_level@entry=0, end_level=<optimized out>, caller=<optimized out>)
at db/version_set.cc:5657
https://github.com/facebook/rocksdb/issues/8 0x000000000166e894 in rocksdb::CompactionJob::GenSubcompactionBoundaries (this=<optimized out>) at ./include/rocksdb/options.h:1869
https://github.com/facebook/rocksdb/issues/9 0x000000000168c526 in rocksdb::CompactionJob::Prepare (this=this@entry=0x7f75f3ffcf00) at db/compaction/compaction_job.cc:546
```

The problem occurred in `ApproximateSize()` when the index `Seek()` for the first `ApproximateDataOffsetOf()` encountered an I/O error, while the second `Seek()` did not. In the old code that scenario caused `start_offset == data_size` , thus it was easy to trip the assertion that `end_offset >= start_offset`.

The fix is to set `start_offset == 0` when the first index `Seek()` fails, and `end_offset == data_size` when the second index `Seek()` fails. I doubt these give an "on average correct" answer for how this function is used, but I/O errors in index seeks are hopefully rare, it looked consistent with what was already there, and it was easier to calculate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9615

Test Plan:
run the repro command for a while and stopped seeing coredumps -

```
$ while !  ./db_stress --block_size=128 --cache_size=32768 --clear_column_family_one_in=0 --column_families=1 --continuous_verification_interval=0 --db=/dev/shm/rocksdb_crashtest --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --index_type=2 --iterpercent=10  --kill_random_test=18887 --max_key=1000000 --max_bytes_for_level_base=2048576 --nooverwritepercent=1 --open_files=-1 --open_read_fault_one_in=32 --ops_per_thread=1000000 --prefixpercent=5 --read_fault_one_in=0 --readpercent=45 --reopen=0 --skip_verifydb=1 --subcompactions=2 --target_file_size_base=524288 --test_batches_snapshots=0 --value_size_mult=32 --write_buffer_size=524288 --writepercent=35  ; do : ; done
```

Reviewed By: pdillinger

Differential Revision: D34383069

Pulled By: ajkr

fbshipit-source-id: fac26c3b20ea962e75387515ba5f2724dc48719f

0a89cea5

Fix trivial Javadoc omissions (#9534) · ddb7620a

由 stefan-zobel 提交于 2月 28, 2022

Summary:
- fix spelling of `valueSizeSofLimit` and add "param" description in ReadOptions
- add 3 missing "return" in RocksDB

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9534

Reviewed By: riversand963

Differential Revision: D34131186

Pulled By: mrambacher

fbshipit-source-id: 7eb7ec177906052837180b291d67fb1c838c49e1

ddb7620a

28 2月, 2022 1 次提交

Dedicate cacheline for DB mutex (#9637) · 9983eecd

由 Andrew Kryczka 提交于 2月 27, 2022

Summary:
We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637

Reviewed By: riversand963

Differential Revision: D34502233

Pulled By: ajkr

fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07

9983eecd

26 2月, 2022 2 次提交

Add support for BlobDB to ldb (#9630) · 9ed96703

由 Changneng Chen 提交于 2月 25, 2022

Summary:
Add the configuration options and help messages of BlobDB to `ldb`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9630

Test Plan: `python ./tools/ldb_test.py`

Reviewed By: ltamasi

Differential Revision: D34443176

Pulled By: changneng

fbshipit-source-id: 5b3f185cdfc2561e06dd37215c7edfbca07dbe80

9ed96703

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496) · 87a8b3c8

由 Hui Xiao 提交于 2月 25, 2022

Summary:
**Context:**
As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](https://github.com/facebook/rocksdb/commit/e66199d848cd484b816d07359f1dc0f0b99e5351#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process.

However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`.

As a consequence,  `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like
```
db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet'
assert(versions_->io_status().ok());
```

**Summary:**
- I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");`
   - As the context,  the orignal https://github.com/facebook/rocksdb/pull/3997  introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later https://github.com/facebook/rocksdb/pull/6949 distances these two calls a bit.
   - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables.
- In addition, to preserve the previous behavior of  https://github.com/facebook/rocksdb/pull/6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s`
- In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9496

Test Plan:
- Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread
```
sleep(1);
assert(versions_->io_status().ok());
```
   `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
   ```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.MultiCFWALWriteError
Received signal 11 (Segmentation fault)
#0   rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext)  /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421
https://github.com/facebook/rocksdb/issues/1   rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600
https://github.com/facebook/rocksdb/issues/2   rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError()       /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310
https://github.com/facebook/rocksdb/issues/3   rocksdb/error_handler_fs_test()
   ```
- The assertion failure does not happen with PR
`python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError`
`[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms)  `

Reviewed By: riversand963, anand1976

Differential Revision: D33990099

Pulled By: hx235

fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab

87a8b3c8

24 2月, 2022 2 次提交

Streaming Compression API for WAL compression. (#9619) · 21345d28

由 Siddhartha Roychowdhury 提交于 2月 23, 2022

Summary:
Implement a streaming compression API (compress/uncompress) to use for WAL compression. The log_writer would use the compress class/API to compress a record before writing it out in chunks. The log_reader would use the uncompress class/API to uncompress the chunks and combine into a single record.

Added unit test to verify the API for different sizes/compression types.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9619

Test Plan: make -j24 check

Reviewed By: anand1976

Differential Revision: D34437346

Pulled By: sidroyc

fbshipit-source-id: b180569ad2ddcf3106380f8758b556cc0ad18382

21345d28

Add a secondary cache implementation based on LRUCache 1 (#9518) · f706a9c1

由 Bo Wang 提交于 2月 23, 2022

Summary:
**Summary:**
RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.

This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518

Test Plan:
In this PR, the lru_secondary_cache_test.cc includes the following tests:
1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
2. The integration tests with both primary cache and this secondary cache .

**Follow Up:**

1. Statistics (e.g. compression ratio) will be added in another PR.
2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.

Reviewed By: anand1976

Differential Revision: D34430930

Pulled By: gitbw95

fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded

f706a9c1

23 2月, 2022 5 次提交

Support WBWI for keys having timestamps (#9603) · 6f125998

由 Yanqin Jin 提交于 2月 22, 2022

Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**

Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.

When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D34354849

Pulled By: riversand963

fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0

6f125998

Fix test race conditions with OnFlushCompleted() (#9617) · 8ca433f9

由 Andrew Kryczka 提交于 2月 22, 2022

Summary:
We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, https://github.com/facebook/rocksdb/issues/9084, https://github.com/facebook/rocksdb/issues/9400, https://github.com/facebook/rocksdb/issues/9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed  issues it found besides old BlobDB.

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 0e1864788..aaba68c4a 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();
+  sleep(1);
   {
     for (auto& info : *flush_jobs_info) {
       info->triggered_writes_slowdown = triggered_writes_slowdown;
```

The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9617

Test Plan: `make check` including the race condition coercing patch

Reviewed By: hx235

Differential Revision: D34384022

Pulled By: ajkr

fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e

8ca433f9

Enable core dumps in TSAN/UBSAN crash tests (#9616) · 96978e4d

由 Andrew Kryczka 提交于 2月 22, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9616

Reviewed By: hx235

Differential Revision: D34383489

Pulled By: ajkr

fbshipit-source-id: e4299000ef38073ec57e6ab5836150fdf8ce43d4

96978e4d

Combine data members of IOStatus with Status (#9549) · d795a730

由 anand76 提交于 2月 22, 2022

Summary:
Combine the data members retryable_, data_loss_ and scope_ of IOStatus
with Status, as protected members. IOStatus is now defined as a derived class of Status with
no new data, but additional methods. This will allow us to eventually
track the result of FileSystem calls in RocksDB with one variable
instead of two.

Benchmark commands and results are below. The performance after changes seems slightly better.

```./db_bench -db=/data/mysql/rocksdb/prefix_scan -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216```

```./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,seekrandom,readseq" -key_size=32 -value_size=512 -num=5000000 -seek_nexts=10000 -use_direct_reads=true -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=false -threads=1 -cache_size=10485760000```

Before -
seekrandom   :    3715.432 micros/op 269 ops/sec; 1394.9 MB/s (16149 of 16149 found)
seekrandom   :    3687.177 micros/op 271 ops/sec; 1405.6 MB/s (16273 of 16273 found)
seekrandom   :    3709.646 micros/op 269 ops/sec; 1397.1 MB/s (16175 of 16175 found)

readseq      :       0.369 micros/op 2711321 ops/sec; 1406.6 MB/s
readseq      :       0.363 micros/op 2754092 ops/sec; 1428.8 MB/s
readseq      :       0.372 micros/op 2688046 ops/sec; 1394.6 MB/s

After -
seekrandom   :    3606.830 micros/op 277 ops/sec; 1436.9 MB/s (16636 of 16636 found)
seekrandom   :    3594.467 micros/op 278 ops/sec; 1441.9 MB/s (16693 of 16693 found)
seekrandom   :    3597.919 micros/op 277 ops/sec; 1440.5 MB/s (16677 of 16677 found)

readseq      :       0.354 micros/op 2822809 ops/sec; 1464.5 MB/s
readseq      :       0.358 micros/op 2795080 ops/sec; 1450.1 MB/s
readseq      :       0.354 micros/op 2822889 ops/sec; 1464.5 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9549

Reviewed By: pdillinger

Differential Revision: D34310362

Pulled By: anand1976

fbshipit-source-id: 54b27756edf9c9ecfe730a2dce542a7a46743096

d795a730

configure microbenchmarks, regenerate targets (#9599) · ba65cfff

由 Patrick Somaru 提交于 2月 22, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9599

Reviewed By: jay-zhuang, hodgesds

Differential Revision: D34214408

fbshipit-source-id: 6932200772f52ce77e550646ee3d1a928295844a

ba65cfff

22 2月, 2022 1 次提交

Fix DBTest2.BackupFileTemperature memory leak (#9610) · 3379d146

由 Andrew Kryczka 提交于 2月 21, 2022

Summary:
Valgrind was failing with the below error because we forgot to destroy
the `BackupEngine` object:

```
==421173== Command: ./db_test2 --gtest_filter=DBTest2.BackupFileTemperature
==421173==
Note: Google Test filter = DBTest2.BackupFileTemperature
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBTest2
[ RUN      ] DBTest2.BackupFileTemperature
--421173-- WARNING: unhandled amd64-linux syscall: 425
--421173-- You may be able to write your own handler.
--421173-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--421173-- Nevertheless we consider this a bug.  Please report
--421173-- it at http://valgrind.org/support/bug_reports.html.
[       OK ] DBTest2.BackupFileTemperature (3366 ms)
[----------] 1 test from DBTest2 (3371 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3413 ms total)
[  PASSED  ] 1 test.
==421173==
==421173== HEAP SUMMARY:
==421173==     in use at exit: 13,042 bytes in 195 blocks
==421173==   total heap usage: 26,022 allocs, 25,827 frees, 27,555,265 bytes allocated
==421173==
==421173== 8 bytes in 1 blocks are possibly lost in loss record 6 of 167
==421173==    at 0x4838DBF: operator new(unsigned long) (vg_replace_malloc.c:344)
==421173==    by 0x8D4606: allocate (new_allocator.h:114)
==421173==    by 0x8D4606: allocate (alloc_traits.h:445)
==421173==    by 0x8D4606: _M_allocate (stl_vector.h:343)
==421173==    by 0x8D4606: reserve (vector.tcc:78)
==421173==    by 0x8D4606: rocksdb::BackupEngineImpl::Initialize() (backupable_db.cc:1174)
==421173==    by 0x8D5473: Initialize (backupable_db.cc:918)
==421173==    by 0x8D5473: rocksdb::BackupEngine::Open(rocksdb::BackupEngineOptions const&, rocksdb::Env*, rocksdb::BackupEngine**) (backupable_db.cc:937)
==421173==    by 0x50AC8F: Open (backup_engine.h:585)
==421173==    by 0x50AC8F: rocksdb::DBTest2_BackupFileTemperature_Test::TestBody() (db_test2.cc:6996)
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9610

Test Plan:
```
$ make -j24 ROCKSDBTESTS_SUBSET=db_test2 valgrind_check_some
```

Reviewed By: akankshamahajan15

Differential Revision: D34371210

Pulled By: ajkr

fbshipit-source-id: 68154fcb0c51b28222efa23fa4ee02df8d925a18

3379d146

21 2月, 2022 1 次提交

Update HISTORY.md and version.h for 7.0 release (#9609) · 7ae4da92

由 Andrew Kryczka 提交于 2月 20, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9609

Reviewed By: riversand963

Differential Revision: D34370309

Pulled By: ajkr

fbshipit-source-id: 5fc9306439aefa4b2d61d847534ea6758c30b6a5

7ae4da92

19 2月, 2022 9 次提交

Change enum SizeApproximationFlags to enum class (#9604) · 3699b171

由 Akanksha Mahajan 提交于 2月 18, 2022

Summary:
Change enum SizeApproximationFlags to enum and class and add
overloaded operators for the transition between enum class and uint8_t

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9604

Test Plan: Circle CI jobs

Reviewed By: riversand963

Differential Revision: D34360281

Pulled By: akankshamahajan15

fbshipit-source-id: 6351dfdb717ae3c4530d324c3d37a8ecb01dd1ef

3699b171

Add Temperature info in `NewSequentialFile()` (#9499) · d3a2f284

由 Jay Zhuang 提交于 2月 18, 2022

Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499

Test Plan: Added unittest

Reviewed By: pdillinger

Differential Revision: D34006115

Pulled By: jay-zhuang

fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f

d3a2f284

Add Async Read and Poll APIs in FileSystem (#9564) · 559525dc

由 Akanksha Mahajan 提交于 2月 18, 2022

Summary:
This PR adds support for new APIs Async Read that reads the data
asynchronously and Poll API that checks if requested read request has
completed or not.

Usage: In RocksDB, we are currently planning to prefetch data
asynchronously during sequential scanning and RocksDB will call these
APIs to prefetch more data in advanced.

Design:
- ReadAsync API submits the read request to underlying FileSystem in
order to read data asynchronously. When read request is completed,
callback function will be called. cb_arg is used by RocksDB to track the
original request submitted and IOHandle is used by FileSystem to keep track
of IO requests at their level.

- The Poll API  is added in FileSystem because the call could end up handling
completions for multiple different files which is not specific to a
FSRandomAccessFile instance. There could be multiple outstanding file reads
from different files in future and they can complete in any order.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9564

Test Plan: Test will be added in separate PR.

Reviewed By: anand1976

Differential Revision: D34226216

Pulled By: akankshamahajan15

fbshipit-source-id: 95e64edafb17f543f7232421d51e2665a3267f69

559525dc

Fixes #9565 (#9586) · 67f071fa

由 Bo Wang 提交于 2月 18, 2022

Summary:
[Compaction::IsTrivialMove](https://github.com/facebook/rocksdb/blob/a2b9be42b6d5ac4d44bcc6a9451a825440000769/db/compaction/compaction.cc#L318) checks whether allow_trivial_move is set, and if so it returns the value of is_trivial_move_. The allow_trivial_move option is there for universal compaction. So when this is set and leveled compaction is enabled, then useful code that follows this block never gets a chance to run.

A check that [compaction_style == kCompactionStyleUniversal](https://github.com/facebook/rocksdb/blob/320d9a8e8a1b6998f92934f87fc71ad8bd6d4596/db/db_impl/db_impl_compaction_flush.cc#L1030) should be added to avoid doing the wrong thing for leveled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9586

Test Plan:
To reproduce this:
First edit db/compaction/compaction.cc with
```
 diff --git a/db/compaction/compaction.cc b/db/compaction/compaction.cc
index 7ae50b91e..52dd489b1 100644
 --- a/db/compaction/compaction.cc
+++ b/db/compaction/compaction.cc
@@ -319,6 +319,8 @@ bool Compaction::IsTrivialMove() const {
   // input files are non overlapping
   if ((mutable_cf_options_.compaction_options_universal.allow_trivial_move) &&
       (output_level_ != 0)) {
+    printf("IsTrivialMove:: return %d because universal allow_trivial_move\n", (int) is_trivial_move_);
+    // abort();
     return is_trivial_move_;
   }
```

And then run
```
./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641328309 --universal_allow_trivial_move=1
```
Example output with the debug code added

```
IsTrivialMove:: return 0 because universal allow_trivial_move
IsTrivialMove:: return 0 because universal allow_trivial_move
```

After this PR, the bug is fixed.

Reviewed By: ajkr

Differential Revision: D34350451

Pulled By: gitbw95

fbshipit-source-id: 3232005cc47c40a7e75d316cfc7960beb5bdff3a

67f071fa

fix issue with buckifier update (#9602) · 736bc832

由 pat somaru 提交于 2月 18, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9602

Reviewed By: jay-zhuang

Differential Revision: D34350406

Pulled By: likewhatevs

fbshipit-source-id: caa81f272a429fbf7293f0588ea24cc53b29ee98

736bc832

Add last level and non-last level read statistics (#9519) · f4b2500e

由 Jay Zhuang 提交于 2月 18, 2022

Summary:
Add last level and non-last level read statistics:
```
LAST_LEVEL_READ_BYTES,
LAST_LEVEL_READ_COUNT,
NON_LAST_LEVEL_READ_BYTES,
NON_LAST_LEVEL_READ_COUNT,
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D34062539

Pulled By: jay-zhuang

fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd

f4b2500e

Make FilterPolicy Customizable (#9590) · 30b08878

由 mrambacher 提交于 2月 18, 2022

Summary:
Make FilterPolicy into a Customizable class.  Allow new FilterPolicy to be discovered through the ObjectRegistry

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590

Reviewed By: pdillinger

Differential Revision: D34327367

Pulled By: mrambacher

fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190

30b08878

update buckifier, add support for microbenchmarks (#9598) · f066b5ce

由 Patrick Somaru 提交于 2月 18, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9598

Reviewed By: jay-zhuang, hodgesds

Differential Revision: D34130191

fbshipit-source-id: e5413f7d6af70a66940022d153b64a3383eccff1

f066b5ce

Add temperature information to the event listener callbacks (#9591) · 2fbc6727

由 Jay Zhuang 提交于 2月 18, 2022

Summary:
RocksDB try to provide temperature information in the event
listener callbacks. The information is not guaranteed, as some operation
like backup won't have these information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591

Test Plan: Added unittest

Reviewed By: siying, pdillinger

Differential Revision: D34309339

Pulled By: jay-zhuang

fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7

2fbc6727

18 2月, 2022 12 次提交

Change type of cache buffer passed to `Cache::CreateCallback()` to `const void*` (#9595) · 54fb2a89

由 Andrew Kryczka 提交于 2月 17, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9595

Reviewed By: riversand963

Differential Revision: D34329906

Pulled By: ajkr

fbshipit-source-id: 508601856fa9bee4d40f4a68d14d333ef2143d40

54fb2a89

Mark more OldDefaults as deprecated (#9594) · 48b9de4a

由 Peter Dillinger 提交于 2月 17, 2022

Summary:
`ColumnFamilyOptions::OldDefaults` and `DBOptions::OldDefaults`
now deprecated. Were previously overlooked with `Options::OldDefaults` in https://github.com/facebook/rocksdb/issues/9363

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9594

Test Plan: comments only

Reviewed By: jay-zhuang

Differential Revision: D34318592

Pulled By: pdillinger

fbshipit-source-id: 773c97a61e2a8290ae154f363dd61c1f35a9dd16

48b9de4a

Plugin java jni support (#9575) · ce84e502

由 Alan Paxton 提交于 2月 17, 2022

Summary:
Extend the plugin architecture to allow for the inclusion, building and testing of Java and JNI components of a plugin. This will cause the JAR built by `$ make rocksdbjava` to include the extra functionality provided by the plugin, and will cause `$ make jtest` to add the java tests provided by the plugin to the tests built and run by Java testing.

The plugin's `<plugin>.mk` file can define:
```
<plugin>_JNI_NATIVE_SOURCES
<plugin>_NATIVE_JAVA_CLASSES
<plugin>_JAVA_TESTS
```
The plugin should provide java/src, java/test and java/rocksjni directories. When a plugin is required to be build it must be named in the ROCKSDB_PLUGINS environment variable (as per the plugin architecture). This now has the effect of adding the files specified by the above definitions to the appropriate parts of the build.

An example of a plugin with a Java component can be found as part of the hdfs plugin in https://github.com/riversand963/rocksdb-hdfs-env - at the time of writing the Java part of this fails tests, and needs a little work to complete, but it builds correctly under the plugin model.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9575

Reviewed By: hx235

Differential Revision: D34253948

Pulled By: riversand963

fbshipit-source-id: b3dde5da06f3d3c25c54246892097ae2a369b42d

ce84e502

Some better API and other comments (#9533) · 561be005

由 Peter Dillinger 提交于 2月 17, 2022

Summary:
Various comments, mostly about SliceTransform + prefix extractors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9533

Test Plan: comments only

Reviewed By: ajkr

Differential Revision: D34094367

Pulled By: pdillinger

fbshipit-source-id: 9742ce3b89ef7fd5c5e748fec862e6361ed44e95

561be005

Remove previously deprecated Java where RocksDB also removed it, or where no... · 8d9c203f

由 Alan Paxton 提交于 2月 17, 2022

Remove previously deprecated Java where RocksDB also removed it, or where no direct equivalent existed. (#9576)

Summary:
For RocksDB v7 major release. Remove previously deprecated Java API methods and associated tests
- where equivalent/alternative functionality exists and is already tested AND
- where the core RocksDB function/feature has also been removed
- OR the functionality exists only in Java so the previous deprecation only affected Java methods

RETAIN deprecated Java which reflects functionality which is deprecated by, but also still supported by, the core of RocksDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9576

Reviewed By: ajkr

Differential Revision: D34314983

Pulled By: jay-zhuang

fbshipit-source-id: 7cf9c17e3e07be9d289beb99f81b71e8e09ac403

8d9c203f

Hide FilterBits{Builder,Reader} from public API (#9592) · 725833a4

由 Peter Dillinger 提交于 2月 17, 2022

Summary:
We don't have any evidence of people using these to build custom
filters. The recommended way of customizing filter handling is to
defer to various built-in policies based on FilterBuildingContext
(e.g. to build Monkey filtering policy). With old API, we have
evidence of people modifying keys going into filter, but most cases
of that can be handled with prefix_extractor.

Having FilterBitsBuilder+Reader in the public API is an ogoing
hinderance to code evolution (e.g. recent new Finish and
MaybePostVerify), and so this change removes them from the public API
for 7.0. Maybe they will come back in some form later, but lacking
evidence of them providing value in the public API, we want to take back
more freedom to evolve these.

With this moved to internal-only, there is no rush to clean up the
complex Finish signatures, or add memory allocator support, but doing so
is much easier with them out of public API, for example to use
CacheAllocationPtr without exposing it in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9592

Test Plan: cosmetic changes only

Reviewed By: hx235

Differential Revision: D34315470

Pulled By: pdillinger

fbshipit-source-id: 03e03bb66a72c73df2c464d2dbbbae906dd8f99b

725833a4

Fix some MultiGet batching stats (#9583) · 627deb7c

由 anand76 提交于 2月 17, 2022

Summary:
The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583

Test Plan: Update the unit test in db_basic_test

Reviewed By: akankshamahajan15

Differential Revision: D34308044

Pulled By: anand1976

fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e

627deb7c

Add record to set WAL compression type if enabled (#9556) · 39b0d921

由 Siddhartha Roychowdhury 提交于 2月 17, 2022

Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556

Reviewed By: anand1976

Differential Revision: D34308216

Pulled By: sidroyc

fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705

39b0d921

Add subcompaction event API (#9311) · f092f0fa

由 Jay Zhuang 提交于 2月 17, 2022

Summary:
Add event callback for subcompaction and adds a sub_job_id to identify it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311

Reviewed By: ajkr

Differential Revision: D33892707

Pulled By: jay-zhuang

fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e

f092f0fa

Clarify compiler support release note (#9593) · a86ee02d

由 Peter Dillinger 提交于 2月 17, 2022

Summary:
in HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9593

Test Plan: release note only

Reviewed By: siying

Differential Revision: D34318189

Pulled By: pdillinger

fbshipit-source-id: ba2eca8bede2d42a3fefd10b954b92cb54f831f2

a86ee02d

Update build files for java8 build (#9541) · 36ce2e2a

由 Alan Paxton 提交于 2月 17, 2022

Summary:
For RocksJava 7 we will move from requiring Java 7 to Java 8.

* This simplifies the `Makefile` as we no longer need to deal with Java 7; so we no longer use `javah`.
* Added a java-version target which is invoked by the java target, and which exits if the version of java being used is not 8 or greater.
* Enforces java 8 as a minimum.
* Fixed CMake build.

* Fixed broken java event listener test, as the test was broken and the assertions in the callbacks were not causing assertions in the tests. The callbacks now queue up assertion errors for the main thread of the tests to check.
* Fixed C++ dangling pointers in the test code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9541

Reviewed By: pdillinger

Differential Revision: D34214929

Pulled By: jay-zhuang

fbshipit-source-id: fdff348758d0a23a742e83c87d5f54073ce16ca6

36ce2e2a

Support C++17 Docker build environments for RocksJava (#9500) · 5e644079

由 Adam Retter 提交于 2月 17, 2022

Summary:
See https://github.com/facebook/rocksdb/issues/9388#issuecomment-1029583789

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9500

Reviewed By: pdillinger

Differential Revision: D34114687

Pulled By: jay-zhuang

fbshipit-source-id: 22129d99ccd0dba7e8f1b263ddc5520d939641bf

5e644079

17 2月, 2022 2 次提交

Add rate limiter priority to ReadOptions (#9424) · babe56dd

由 Andrew Kryczka 提交于 2月 16, 2022

Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c

babe56dd

Fix a silent data loss for write-committed txn (#9571) · 1cda273d

由 Yanqin Jin 提交于 2月 16, 2022

Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V
```

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571

Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```

Reviewed By: siying

Differential Revision: D34235236

Pulled By: riversand963

fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1

1cda273d

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功