提交 · b0e190604bacb2dd47f71260b78a9f93df5842f3 · kvdb / rocksdb

26 5月, 2022 2 次提交

Update VersionSet last seqno after LogAndApply (#10051) · b0e19060

由 Changyu Bi 提交于 5月 25, 2022

Summary:
This PR fixes the issue of unstable snapshot during external SST file ingestion. Credit ajkr for the following walk through:  consider these relevant steps for of IngestExternalFile():

(1) increase seqno while holding mutex -- https://github.com/facebook/rocksdb/blob/677d2b4a8f8fd19d0c39a9ee8f648742e610688d/db/db_impl/db_impl.cc#L4768
(2) LogAndApply() -- https://github.com/facebook/rocksdb/blob/677d2b4a8f8fd19d0c39a9ee8f648742e610688d/db/db_impl/db_impl.cc#L4797-L4798
  (a) write to MANIFEST with mutex released https://github.com/facebook/rocksdb/blob/a96a4a2f7ba7633ab2cc51defd1e923e20d239a6/db/version_set.cc#L4407
  (b) apply to in-memory state with mutex held

A snapshot taken during (2a) will be unstable. In particular, queries against that snapshot will not include data from the ingested file before (2b), and will include data from the ingested file after (2b).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10051

Test Plan:
Added a new unit test: `ExternalSSTFileBasicTest.WriteAfterReopenStableSnapshotWhileLoggingToManifest`.
```
make external_sst_file_basic_test
./external_sst_file_basic_test
```

Reviewed By: ajkr

Differential Revision: D36654033

Pulled By: cbi42

fbshipit-source-id: bf720cca313e0cf211585960f3aff04853a31b96

b0e19060

Improve transaction C-API (#9252) · b71466e9

由 Yiyuan Liu 提交于 5月 25, 2022

Summary:
This PR wants to improve support for transaction in C-API:
* Support two-phase commit.
* Support `get_pinned` and `multi_get` in transaction.
* Add `rocksdb_transactiondb_flush`
* Support get writebatch from transaction and rebuild transaction from writebatch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9252

Reviewed By: jay-zhuang

Differential Revision: D36459007

Pulled By: riversand963

fbshipit-source-id: 47371d527be821c496353a7fe2fd18d628069a98

b71466e9

25 5月, 2022 8 次提交

Enable checkpoint and backup in db_stress when timestamp is enabled (#10047) · 9901e7f6

由 Yanqin Jin 提交于 5月 24, 2022

Summary:
After https://github.com/facebook/rocksdb/issues/10030 and https://github.com/facebook/rocksdb/issues/10004, we can enable checkpoint and backup in stress tests when
user-defined timestamp is enabled.

This PR has no production risk.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10047

Test Plan:
```
TEST_TMPDIR=/dev/shm make crash_test_with_ts
```

Reviewed By: jowlyzhang

Differential Revision: D36641565

Pulled By: riversand963

fbshipit-source-id: d86c9d87efcc34c32d1aa176af691d32b897644a

9901e7f6

Fix potential ambiguities in/around port/sys_time.h (#10045) · af7ae912

由 Levi Tamasi 提交于 5月 24, 2022

Summary:
There are some time-related POSIX APIs that are not available on Windows
(e.g. `localtime_r`), which we have worked around by providing our own
implementations in `port/sys_time.h`. This workaround actually relies on
some ambiguity: on Windows, a call to `localtime_r` calls
`ROCKSDB_NAMESPACE::port::localtime_r` (which is pulled into
`ROCKSDB_NAMESPACE` by a using-declaration), while on other platforms
it calls the global `localtime_r`. This works fine as long as there is only one
candidate function; however, it breaks down when there is more than one
`localtime_r` visible in a scope.

The patch fixes this by introducing `ROCKSDB_NAMESPACE::port::{TimeVal, GetTimeOfDay, LocalTimeR}`
to eliminate any ambiguity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10045

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D36639372

Pulled By: ltamasi

fbshipit-source-id: fc13dbfa421b7c8918111a6d9e24ce77e91a7c50

af7ae912

Fix ApproximateOffsetOfCompressed test (#10048) · a96a4a2f

由 Jay Zhuang 提交于 5月 24, 2022

Summary:
https://github.com/facebook/rocksdb/issues/9857 introduced new an option `use_zstd_dict_trainer`, which
is stored in SST as text, e.g.:
```
...  zstd_max_train_bytes=0; enabled=0;...
```
it increased the sst size a little bit and cause
`ApproximateOffsetOfCompressed` test to fail:
```
Value 7053 is not in range [4000, 7050]
table/table_test.cc:4019: Failure
Value of: Between(c.ApproximateOffsetOf("xyz"), 4000, 7050)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10048

Test Plan: verified the test pass after the change

Reviewed By: cbi42

Differential Revision: D36643688

Pulled By: jay-zhuang

fbshipit-source-id: bf12d211f6ae71937259ef21b1226bd06e8da717

a96a4a2f

Skip ZSTD dict tests if the version doesn't support it (#10046) · 23f34c7a

由 Jay Zhuang 提交于 5月 24, 2022

Summary:
For example, the default ZSTD version for ubuntu20 is 1.4.4, which will
fail the test `PresetCompressionDict`:

```
db/db_test_util.cc:607: Failure
Invalid argument: zstd finalizeDictionary cannot be used because ZSTD 1.4.5+ is not linked with the binary.
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10046

Test Plan: test pass with old zstd

Reviewed By: cbi42

Differential Revision: D36640067

Pulled By: jay-zhuang

fbshipit-source-id: b1c49fb7295f57f4515ce4eb3a52ae7d7e45da86

23f34c7a

Avoid malloc_usable_size() call inside LRU Cache mutex (#10026) · c78a87cd

由 sdong 提交于 5月 24, 2022

Summary:
In LRU Cache mutex, we sometimes call malloc_usable_size() to calculate memory used by the metadata object. We prevent it by saving the charge + metadata size, rather than charge, inside the metadata itself. Within the mutex, usually only total charge is needed so we don't need to repeat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10026

Test Plan: Run existing tests.

Reviewed By: pdillinger

Differential Revision: D36556253

fbshipit-source-id: f60c96d13cde3af77732e5548e4eac4182fa9801

c78a87cd

Add timestamp support to CompactedDBImpl (#10030) · d4081bf0

由 Yu Zhang 提交于 5月 24, 2022

Summary:
This PR is the second and last part for adding user defined timestamp support to read only DB. Specifically, the change in this PR includes:

- `options.timestamp` respected by `CompactedDBImpl::Get` and `CompactedDBImpl::MultiGet` to return results visible up till that timestamp.
- `CompactedDBImpl::Get(...,std::string* timestsamp)` and `CompactedDBImpl::MultiGet(std::vector<std::string>* timestamps)` return the timestamp(s) associated with the key(s).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10030

Test Plan:
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_readonly_with_timestamp_test --gtest_filter="DBReadOnlyTestWithTimestamp.CompactedDB*"
$./db_basic_test --gtest_filter="DBBasicTest.CompactedDB*"
$make all check
```

Reviewed By: riversand963

Differential Revision: D36613926

Pulled By: jowlyzhang

fbshipit-source-id: 5b7ed7fef822708c12e2caf7a8d2deb6a696f0f0

d4081bf0

Support read rate-limiting in SequentialFileReader (#9973) · 8515bd50

由 Changyu Bi 提交于 5月 24, 2022

Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).

The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973

Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
  - Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
  - Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
  - backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
  - restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)

```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000

# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
   ./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
  # Restore
  #./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```

Reviewed By: hx235

Differential Revision: D36327418

Pulled By: cbi42

fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691

8515bd50

Fix failed VerifySstUniqueIds unittests (#10043) · fd24e447

由 Jay Zhuang 提交于 5月 24, 2022

Summary:
which should use UniqueId64x2 instead of string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10043

Test Plan: unittest

Reviewed By: pdillinger

Differential Revision: D36620366

Pulled By: jay-zhuang

fbshipit-source-id: cf937a1da362018472fa4396848225e48893848b

fd24e447

24 5月, 2022 7 次提交

Expose unix time in rocksdb::Snapshot (#9923) · 700d597b

由 Akanksha Mahajan 提交于 5月 23, 2022

Summary:
RocksDB snapshot already has a member unix_time_ set after
snapshot is taken. It is now exposed through GetSnapshotTime() API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9923

Test Plan: Update unit tests

Reviewed By: riversand963

Differential Revision: D36048275

Pulled By: akankshamahajan15

fbshipit-source-id: 825210ec287deb0bc3aaa9b8e1f079f07ad686fa

700d597b

Fix fbcode internal build failure (#10041) · 8e9d9156

由 anand76 提交于 5月 23, 2022

Summary:
The build failed due to different namespaces for coroutines (std::experimental vs std) based on compiler version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10041

Reviewed By: ltamasi

Differential Revision: D36617212

Pulled By: anand1976

fbshipit-source-id: dfb25320788d32969317d5651173059e2cbd8bd5

8e9d9156

Update version on main to 7.4 and add 7.3 to the format compatibility checks (#10038) · 253ae017

由 Levi Tamasi 提交于 5月 23, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10038

Reviewed By: riversand963

Differential Revision: D36604533

Pulled By: ltamasi

fbshipit-source-id: 54ccd0a4b32a320b5640a658ea6846ee897065d1

253ae017

Fix stress test failure "Corruption: checksum mismatch" or "Iterator Diverged"... · a479c2c2

由 Akanksha Mahajan 提交于 5月 23, 2022

Fix stress test failure "Corruption: checksum mismatch" or "Iterator Diverged" with async_io enabled (#10032)

Summary:
In case of non sequential reads with `async_io`, `FilePRefetchBuffer::TryReadFromCacheAsync` can be called for previous blocks with `offset < bufs_[curr_].offset_` which wasn't handled correctly resulting wrong data being returned from buffer.

Since `FilePRefetchBuffer::PrefetchAsync` can be called for any data block, it sets `prev_len_` to 0 indicating `FilePRefetchBuffer::TryReadFromCacheAsync` to go for the prefetching even though offset < bufs_[curr_].offset_ This is because async prefetching is always done in second buffer (to avoid mutex) even though curr_ is empty leading to offset < bufs_[curr_].offset_ in some cases.
If prev_len_ is non zero then `TryReadFromCacheAsync` returns false if `offset < bufs_[curr_].offset_ && prev_len != 0` indicating reads are not sequential and previous call wasn't PrefetchAsync.

- This PR also simplifies `FilePRefetchBuffer::TryReadFromCacheAsync` as it was getting complicated covering different scenarios based on `async_io` enabled/disabled. If `for_compaction` is set true, it now calls `FilePRefetchBufferTryReadFromCache` following synchronous flow as before. Its decided in BlockFetcher.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10032

Test Plan:
1. export CRASH_TEST_EXT_ARGS=" --async_io=1"
make crash_test -j completed successfully locally
2. make crash_test -j completed successfully locally
3. Reran CircleCi mini crashtest job 4 - 5 times.
4. Updated prefetch_test for more coverage.

Reviewed By: anand1976

Differential Revision: D36579858

Pulled By: akankshamahajan15

fbshipit-source-id: 0c428d62b45e12e082a83acf533a5e37a584bedf

a479c2c2

Move three info logging within DB Mutex to use log buffer (#10029) · bea5831b

由 sdong 提交于 5月 23, 2022

Summary:
info logging with DB Mutex could potentially invoke I/O and cause performance issues. Move three of the cases to use log buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10029

Test Plan: Run existing tests.

Reviewed By: jay-zhuang

Differential Revision: D36561694

fbshipit-source-id: cabb93fea299001a6b4c2802fcba3fde27fa062c

bea5831b

Java build: finish compiling before testing (etc) (#10034) · 1e4850f6

由 Peter Dillinger 提交于 5月 23, 2022

Summary:
Lack of ordering dependencies could lead to random
build-linux-java failures with "Truncated class file" because tests
started before compilation was finished. (Fix to java/Makefile)

Also:
* export SHA256_CMD to save copy-paste
* Actually fail if Java sample build fails--which it was in CircleCI
* Don't require Snappy for Java sample build (for more compatibility)
* Remove check_all_python from jtest because it's running in `make
check` builds in CircleCI

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10034

Test Plan: CI, some manual

Reviewed By: jay-zhuang

Differential Revision: D36596541

Pulled By: pdillinger

fbshipit-source-id: 230d79db4b7ae93a366871ff09d0a88e8e1c8af3

1e4850f6

Add plugin header install in CMakeLists.txt (#10025) · cb858600

由 Tom Blamer 提交于 5月 23, 2022

Summary:
Fixes https://github.com/facebook/rocksdb/issues/9987.
- Plugin specific headers can be specified by setting ${PLUGIN_NAME}_HEADERS in ${PLUGIN_NAME}.mk in the plugin directory.
- This is supported by the Makefile based build, but was missing from CMakeLists.txt.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10025

Test Plan:
- Add a plugin with ${PLUGIN_NAME}_HEADERS set in both ${PLUGIN_NAME}.mk and CMakeLists.txt
- Run Makefile based install and CMake based install and verify installed headers match

Reviewed By: riversand963

Differential Revision: D36584908

Pulled By: ajkr

fbshipit-source-id: 5ea0137205ccbf0d36faacf45d712c5604065bb5

cb858600

23 5月, 2022 1 次提交

Minimum macOS version needed to build v7.2.2 and up is 10.13 (#9976) · 56ce3aef

由 Adam Retter 提交于 5月 22, 2022

Summary:
Some C++ code changes between version 7.1.2 and 7.2.2 now seem to require at least macOS 10.13 (2017) to build successfully, previously we needed 10.12 (2016). I haven't been able to identify the exact commit.

**NOTE**: This needs to be merged to both `main` and `7.2.fb` branches.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9976

Reviewed By: jay-zhuang

Differential Revision: D36303226

Pulled By: ajkr

fbshipit-source-id: 589ce3ecf821db3402b0876e76d37b407896c945

56ce3aef

21 5月, 2022 7 次提交

Update HISTORY for 7.3 release (#10031) · bed40e72

由 Levi Tamasi 提交于 5月 21, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10031

Reviewed By: riversand963

Differential Revision: D36567741

Pulled By: ltamasi

fbshipit-source-id: 058f8cc856d276db6e1aed07a89ac0b7118c4435

bed40e72

Point libprotobuf-mutator to the latest verified commit hash (#10028) · 899db56a

由 Yanqin Jin 提交于 5月 20, 2022

Summary:
Recent updates to https://github.com/google/libprotobuf-mutator has caused link errors for RocksDB
CircleCI job 'build-fuzzers'. This PR points the CI to a specific, most recent verified commit hash.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10028

Test Plan: watch for CI to finish.

Reviewed By: pdillinger, jay-zhuang

Differential Revision: D36562517

Pulled By: riversand963

fbshipit-source-id: ba5ef0f9ed6ea6a75aa5dd2768bd5f389ac14f46

899db56a

Fix a bug of not setting enforce_single_del_contracts (#10027) · f648915b

由 Yanqin Jin 提交于 5月 20, 2022

Summary:
Before this PR, BuildDBOptions() does not set a newly-added option, i.e.
enforce_single_del_contracts, causing OPTIONS files to contain incorrect
information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10027

Test Plan:
make check
Manually check OPTIONS file.

Reviewed By: ltamasi

Differential Revision: D36556125

Pulled By: riversand963

fbshipit-source-id: e1074715b22c328b68c19e9ad89aa5d67d864bb5

f648915b

Seek parallelization (#9994) · 2db6a4a1

由 Akanksha Mahajan 提交于 5月 20, 2022

Summary:
The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
This flow can be parallelized.

Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
- Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
- If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
- If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994

Test Plan:
- **Run Regressions.**
```
./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
```
i) Previous release 7.0 run for normal prefetching with async_io disabled:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

ii) normal prefetching after changes with async_io disable:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Set seed to 1652922591315307 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.3
Date:       Wed May 18 18:09:51 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations;  340.8 MB/s (249 of 249 found)
```
iii) db_bench with async_io enabled completed succesfully

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
Set seed to 1652924062021732 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.3
Date:       Wed May 18 18:34:22 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations;  293.6 MB/s (217 of 217 found)
```

- db_stress with async_io disabled completed succesfully
```
 export CRASH_TEST_EXT_ARGS=" --async_io=0"
 make crash_test -j
```

I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.

Reviewed By: anand1976

Differential Revision: D36459323

Pulled By: akankshamahajan15

fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c

2db6a4a1

Fix crash due to MultiGet async IO and direct IO (#10024) · e015206d

由 anand76 提交于 5月 20, 2022

Summary:
MultiGet with async IO is not officially supported with Posix yet. Avoid a crash by using synchronous MultiRead when direct IO is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10024

Test Plan: Run db_crashtest.py manually

Reviewed By: akankshamahajan15

Differential Revision: D36551053

Pulled By: anand1976

fbshipit-source-id: 72190418fa92dd0397e87825df618b12c9bdecda

e015206d

Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) · cc23b46d

由 Changyu Bi 提交于 5月 20, 2022

Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857

Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench

dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes =  1048576
                    space   cpu/memory
No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```

#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576

for sst_file in `ls ../temp/myrock-sst/`
do
  echo "********** $sst_file **********"
  echo "========== No Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD

  echo "========== Raw Content Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes

  echo "========== FinalizeDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict

  echo "========== TrainDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done

                         010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
```

#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s

echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s

echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s

echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s

No Dictionary
readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
```

Reviewed By: ajkr

Differential Revision: D35720026

Pulled By: cbi42

fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f

cc23b46d

Bump nokogiri from 1.13.4 to 1.13.6 in /docs (#10019) · 6255ac72

由 dependabot[bot] 提交于 5月 20, 2022

Summary:
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.13.4 to 1.13.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/releases">nokogiri's releases</a>.</em></p>
<blockquote>
<h2>1.13.6 / 2022-05-08</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Address <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-29181">CVE-2022-29181</a>, improper handling of unexpected data types, related to untrusted inputs to the SAX parsers. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-xh29-r2w5-wx8m">GHSA-xh29-r2w5-wx8m</a> for more information.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li><code>{HTML4,XML}::SAX::{Parser,ParserContext}</code> constructor methods now raise <code>TypeError</code> instead of segfaulting when an incorrect type is passed.</li>
</ul>
<hr />
<p>sha256:</p>
<pre><code>58417c7c10f78cd1c0e1984f81538300d4ea98962cfd3f46f725efee48f9757a  nokogiri-1.13.6-aarch64-linux.gem
a2b04ec3b1b73ecc6fac619b41e9fdc70808b7a653b96ec97d04b7a23f158dbc  nokogiri-1.13.6-arm64-darwin.gem
4437f2d03bc7da8854f4aaae89e24a98cf5c8b0212ae2bc003af7e65c7ee8e27  nokogiri-1.13.6-java.gem
99d3e212bbd5e80aa602a1f52d583e4f6e917ec594e6aa580f6aacc253eff984  nokogiri-1.13.6-x64-mingw-ucrt.gem
a04f6154a75b6ed4fe2d0d0ff3ac02f094b54e150b50330448f834fa5726fbba  nokogiri-1.13.6-x64-mingw32.gem
a13f30c2863ef9e5e11240dd6d69ef114229d471018b44f2ff60bab28327de4d  nokogiri-1.13.6-x86-linux.gem
63a2ca2f7a4f6bd9126e1695037f66c8eb72ed1e1740ef162b4480c57cc17dc6  nokogiri-1.13.6-x86-mingw32.gem
2b266e0eb18030763277b30dc3d64337f440191e2bd157027441ac56a59d9dfe  nokogiri-1.13.6-x86_64-darwin.gem
3fa37b0c3b5744af45f9da3e4ae9cbd89480b35e12ae36b5e87a0452e0b38335  nokogiri-1.13.6-x86_64-linux.gem
b1512fdc0aba446e1ee30de3e0671518eb363e75fab53486e99e8891d44b8587  nokogiri-1.13.6.gem
</code></pre>
<h2>1.13.5 / 2022-05-04</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated to address <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-29824">CVE-2022-29824</a>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-cgx6-hpwq-fhv5">GHSA-cgx6-hpwq-fhv5</a> for more information.</li>
</ul>
<h3>Dependencies</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated from v2.9.13 to <a href="https://gitlab.gnome.org/GNOME/libxml2/-/releases/v2.9.14">v2.9.14</a>.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li>[CRuby] The libxml2 HTML4 parser no longer exhibits quadratic behavior when recovering some broken markup related to start-of-tag and bare <code>&lt;</code> characters.</li>
</ul>
<h3>Changed</h3>
<ul>
<li>[CRuby] The libxml2 HTML4 parser in v2.9.14 recovers from some broken markup differently. Notably, the XML CDATA escape sequence <code>&lt;![CDATA[</code> and incorrectly-opened comments will result in HTML text nodes starting with <code>&amp;lt;!</code> instead of skipping the invalid tag. This behavior is a direct result of the <a href="https://gitlab.gnome.org/GNOME/libxml2/-/commit/798bdf1">quadratic-behavior fix</a> noted above. The behavior of downstream sanitizers relying on this behavior will also change. Some tests describing the changed behavior are in <a href="https://github.com/sparklemotion/nokogiri/blob/3ed5bf2b5a367cb9dc6e329c5a1c512e1dd4565d/test/html4/test_comments.rb#L187-L204"><code>test/html4/test_comments.rb</code></a>.</li>
</ul>

</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md">nokogiri's changelog</a>.</em></p>
<blockquote>
<h2>1.13.6 / 2022-05-08</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Address <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-29181">CVE-2022-29181</a>, improper handling of unexpected data types, related to untrusted inputs to the SAX parsers. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-xh29-r2w5-wx8m">GHSA-xh29-r2w5-wx8m</a> for more information.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li><code>{HTML4,XML}::SAX::{Parser,ParserContext}</code> constructor methods now raise <code>TypeError</code> instead of segfaulting when an incorrect type is passed.</li>
</ul>
<h2>1.13.5 / 2022-05-04</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated to address <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-29824">CVE-2022-29824</a>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-cgx6-hpwq-fhv5">GHSA-cgx6-hpwq-fhv5</a> for more information.</li>
</ul>
<h3>Dependencies</h3>
<ul>
<li>[CRuby] Vendored libxml2 is updated from v2.9.13 to <a href="https://gitlab.gnome.org/GNOME/libxml2/-/releases/v2.9.14">v2.9.14</a>.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li>[CRuby] The libxml2 HTML parser no longer exhibits quadratic behavior when recovering some broken markup related to start-of-tag and bare <code>&lt;</code> characters.</li>
</ul>
<h3>Changed</h3>
<ul>
<li>[CRuby] The libxml2 HTML parser in v2.9.14 recovers from some broken markup differently. Notably, the XML CDATA escape sequence <code>&lt;![CDATA[</code> and incorrectly-opened comments will result in HTML text nodes starting with <code>&amp;lt;!</code> instead of skipping the invalid tag. This behavior is a direct result of the <a href="https://gitlab.gnome.org/GNOME/libxml2/-/commit/798bdf1">quadratic-behavior fix</a> noted above. The behavior of downstream sanitizers relying on this behavior will also change. Some tests describing the changed behavior are in <a href="https://github.com/sparklemotion/nokogiri/blob/3ed5bf2b5a367cb9dc6e329c5a1c512e1dd4565d/test/html4/test_comments.rb#L187-L204"><code>test/html4/test_comments.rb</code></a>.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/b7817b6a62ac210203a451d1a691a824288e9eab"><code>b7817b6</code></a> version bump to v1.13.6</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/61b1a395cd512af2e0595a8e369465415e574fe8"><code>61b1a39</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2530">https://github.com/facebook/rocksdb/issues/2530</a> from sparklemotion/flavorjones-check-parse-memory-ty...</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/83cc451c3f29df397caa890afc3b714eae6ab8f7"><code>83cc451</code></a> fix: {HTML4,XML}::SAX::{Parser,ParserContext} check arg types</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/22c9e5b300c27a377fdde37c17eb9d07dd7322d0"><code>22c9e5b</code></a> version bump to v1.13.5</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/615588192572f7cfcb43eabbb070a6e07bf9e731"><code>6155881</code></a> doc: update CHANGELOG for v1.13.5</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/c519a47ab11f5e8fce77328fcb01a7b3befc2b9e"><code>c519a47</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2527">https://github.com/facebook/rocksdb/issues/2527</a> from sparklemotion/2525-update-libxml-2_9_14-v1_13_x</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/66c2886e78f6801def83a549c3e6581ac48e61e8"><code>66c2886</code></a> dep: update libxml2 to v2.9.14</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/b7c4cc35de38fcfdde4da1203d79ae38bc4324bf"><code>b7c4cc3</code></a> test: unpend the LIBXML_LOADED_VERSION test on freebsd</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/eac793487183a5e72464e53cccd260971d5f29b5"><code>eac7934</code></a> dev: require yaml</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/f3521ba3d38922d76dd5ed59705eab3988213712"><code>f3521ba</code></a> style(rubocop): pend Style/FetchEnvVar for now</li>
<li>Additional commits viewable in <a href="https://github.com/sparklemotion/nokogiri/compare/v1.13.4...v1.13.6">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=nokogiri&package-manager=bundler&previous-version=1.13.4&new-version=1.13.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

 ---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).

</details>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10019

Reviewed By: riversand963

Differential Revision: D36536897

Pulled By: ajkr

fbshipit-source-id: 368c24e86d5d39f0a3adc08a397ae074b1b18b1a

6255ac72

20 5月, 2022 5 次提交

Add timestamp support to DBImplReadOnly (#10004) · 16bdb1f9

由 Yu Zhang 提交于 5月 19, 2022

Summary:
This PR adds timestamp support to a read only DB instance opened as `DBImplReadOnly`. A follow up PR will add the same support to `CompactedDBImpl`.

 With this, read only database has these timestamp related APIs:

`ReadOptions.timestamp` : read should return the latest data visible to this specified timestamp
`Iterator::timestamp()` : returns the timestamp associated with the key, value
`DB:Get(..., std::string* timestamp)` : returns the timestamp associated with the key, value in `timestamp`

Test plan (on devserver):

```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.ReadOnlyDB*
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10004

Reviewed By: riversand963

Differential Revision: D36434422

Pulled By: jowlyzhang

fbshipit-source-id: 5d949e65b1ffb845758000e2b310fdd4aae71cfb

16bdb1f9

Multi file concurrency in MultiGet using coroutines and async IO (#9968) · 57997dda

由 anand76 提交于 5月 19, 2022

Summary:
This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.

A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.

TODO:
1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
2. Do some stress testing with coroutines enabled

No regression in synchronous MultiGet between this branch and main -
```
./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
```
Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```

Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```

More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.

1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```

2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```

3. Single thread CPU bound workload with ~2 key overlap/file -
No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```

4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968

Reviewed By: akankshamahajan15

Differential Revision: D36348563

Pulled By: anand1976

fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696

57997dda

Address comments for PR #9988 and #9996 (#10020) · 5be1579e

由 Bo Wang 提交于 5月 19, 2022

Summary:
1. The latest change of DecideRateLimiterPriority in https://github.com/facebook/rocksdb/pull/9988 is reverted.
2. For https://github.com/facebook/rocksdb/blob/main/db/builder.cc#L345-L349
  2.1. Remove `we will regrad this verification as user reads` from the comments.
  2.2. `Do not set` the read_options.rate_limiter_priority to Env::IO_USER . Flush should be a background job.
  2.3. Update db_rate_limiter_test.cc.
3. In IOOptions, mark `prio` as deprecated for future removal.
4. In `file_system.h`, mark `IOPriority` as deprecated for future removal.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10020

Test Plan: Unit tests.

Reviewed By: ajkr

Differential Revision: D36525317

Pulled By: gitbw95

fbshipit-source-id: 011ba421822f8a124e6d25a2661c4e242df6ad36

5be1579e

Fix auto_prefix_mode performance with partitioned filters (#10012) · 280b9f37

由 Peter Dillinger 提交于 5月 19, 2022

Summary:
Essentially refactored the RangeMayExist implementation in
FullFilterBlockReader to FilterBlockReaderCommon so that it applies to
partitioned filters as well. (The function is not called for the
block-based filter case.) RangeMayExist is essentially a series of checks
around a possible PrefixMayExist, and I'm confident those checks should
be the same for partitioned as for full filters. (I think it's likely
that bugs remain in those checks, but this change is overall a simplifying
one.)

Added auto_prefix_mode support to db_bench

Other small fixes as well

Fixes https://github.com/facebook/rocksdb/issues/10003

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012

Test Plan:
Expanded unit test that uses statistics to check for filter
optimization, fails without the production code changes here

Performance: populate two DBs with
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters
```

Observe no measurable change in non-partitioned performance
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec
After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec

Observe big improvement with partitioned (also supported by bloom use statistics)
```
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec
After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec

Reviewed By: siying

Differential Revision: D36469796

Pulled By: pdillinger

fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d

280b9f37

Track SST unique id in MANIFEST and verify (#9990) · c6d326d3

由 Jay Zhuang 提交于 5月 19, 2022

Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990

Test Plan: unittests, format-compatible test, mini-crash

Reviewed By: anand1976

Differential Revision: D36381863

Pulled By: jay-zhuang

fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f

c6d326d3

19 5月, 2022 6 次提交

Mark old reserve* option deprecated (#10016) · dde774db

由 Hui Xiao 提交于 5月 18, 2022

Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9926 removed inefficient `reserve*` option API but forgot to mark them deprecated in `block_based_table_type_info` for compatible table format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10016

Test Plan: build-format-compatible

Reviewed By: pdillinger

Differential Revision: D36484247

Pulled By: hx235

fbshipit-source-id: c41b90cc99fb7ab7098934052f0af7290b221f98

dde774db

Set Read rate limiter priority dynamically and pass it to FS (#9996) · 4da34b97

由 gitbw95 提交于 5月 18, 2022

Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.

### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.

**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:

| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
|  Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
|  Compaction | IO_LOW  | IO_USER | IO_USER |
|  User | User provided | User provided | User provided |

We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".

**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().

2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions,  including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().

### Test Plans
Add unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996

Reviewed By: anand1976

Differential Revision: D36452483

Pulled By: gitbw95

fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf

4da34b97

Remove two tests from platform dependent tests (#10017) · f1303bf8

由 sdong 提交于 5月 18, 2022

Summary:
Platform dependent tests sometimes run too long and causes timeout in Travis. Remove two tests that are less likely to be platform dependent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10017

Test Plan: Watch Travis tests.

Reviewed By: pdillinger

Differential Revision: D36486734

fbshipit-source-id: 2a3ad1746791c893a790c2a69a3b70f81e7de260

f1303bf8

Remove ROCKSDB_SUPPORT_THREAD_LOCAL define because it's a part of C++11 (#10015) · 0a43061f

由 Yaroslav Stepanchuk 提交于 5月 18, 2022

Summary:
ROCKSDB_SUPPORT_THREAD_LOCAL definition has been removed.
`__thread`(#define) has been replaced with `thread_local`(C++ keyword) across the code base.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10015

Reviewed By: siying

Differential Revision: D36485491

Pulled By: pdillinger

fbshipit-source-id: 6522d212514ee190b90b4e2750c80c7e34013c78

0a43061f

Avoid overwriting options loaded from OPTIONS (#9943) · e3a3dbf2

由 Yanqin Jin 提交于 5月 18, 2022

Summary:
This is similar to https://github.com/facebook/rocksdb/issues/9862, including the following fixes/refactoring:

1. If OPTIONS file is specified via `-options_file`, majority of options will be loaded from the file. We should not
overwrite options that have been loaded from the file. Instead, we configure only fields of options which are
shared objects and not set by the OPTIONS file. We also configure a few fields, e.g. `create_if_missing` necessary
for stress test to run.

2. Refactor options initialization into three functions, `InitializeOptionsFromFile()`, `InitializeOptionsFromFlags()`
and `InitializeOptionsGeneral()` similar to db_bench. I hope they can be shared in the future. The high-level logic is
as follows:
```cpp
if (!InitializeOptionsFromFile(...)) {
  InitializeOptionsFromFlags(...);
}
InitializeOptionsGeneral(...);
```

3. Currently, the setting for `block_cache_compressed` does not seem correct because it by default specifies a
size of `numeric_limits<size_t>::max()` ((size_t)-1). According to code comments, `-1` indicates default value,
which should be referring to `num_shard_bits` argument.

4. Clarify `fail_if_options_file_error`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9943

Test Plan:
1. make check
2. Run stress tests, and manually check generated OPTIONS file and compare them with input OPTIONS files

Reviewed By: jay-zhuang

Differential Revision: D36133769

Pulled By: riversand963

fbshipit-source-id: 35dacdc090a0a72c922907170cd132b9ecaa073e

e3a3dbf2

Log error message when LinkFile() is not supported when ingesting files (#10010) · a74f14b5

由 sdong 提交于 5月 18, 2022

Summary:
Right now, whether moving file is skipped due to LinkFile() is not supported is opaque to users. Add a log message to help users debug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10010

Test Plan: Run existing test. Manual test verify the log message printed out.

Reviewed By: riversand963

Differential Revision: D36463237

fbshipit-source-id: b00bd5041bd5c11afa4e326819c8461ee2c98a91

a74f14b5

18 5月, 2022 4 次提交

Set Write rate limiter priority dynamically and pass it to FS (#9988) · 05c678e1

由 gitbw95 提交于 5月 18, 2022

Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.

From the RocksDB perspective, there can be two kinds of rate limiters, the internal (native) one and the external one.
- The internal (native) rate limiter is introduced in [the wiki](https://github.com/facebook/rocksdb/wiki/Rate-Limiter). Currently, only IO_LOW and IO_HIGH are used and they are set statically.
- For the external rate limiter, in FSWritableFile functions, IOOptions is open for end users to set and get rate_limiter_priority for their own rate limiter. Currently, RocksDB doesn’t pass the rate_limiter_priority through IOOptions to the file system.

### Solution
During the User Read, Flush write, Compaction read/write, the WriteController is used to determine whether DB writes are stalled or slowed down. The rate limiter priority (Env::IOPriority) can be determined accordingly. We decided to always pass the priority in IOOptions. What the file system does with it should be a contract between the user and the file system. We would like to set the rate limiter priority at file level, since the Flush/Compaction job level may be too coarse with multiple files and block IO level is too granular.

**This PR is for the Write path.** The **Write:** dynamic priority for different state are listed as follows:

| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush | IO_HIGH | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |

Flush and Compaction writes share the same call path through BlockBaseTableWriter, WritableFileWriter, and FSWritableFile. When a new FSWritableFile object is created, its io_priority_ can be set dynamically based on the state of the WriteController. In WritableFileWriter, before the call sites of FSWritableFile functions, WritableFileWriter::DecideRateLimiterPriority() determines the rate_limiter_priority. The options (IOOptions) argument of FSWritableFile functions will be updated with the rate_limiter_priority.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9988

Test Plan: Add unit tests.

Reviewed By: anand1976

Differential Revision: D36395159

Pulled By: gitbw95

fbshipit-source-id: a7c82fc29759139a1a07ec46c37dbf7e753474cf

05c678e1

Add table_properties_collector_factories override (#9995) · b84e3363

由 Jay Zhuang 提交于 5月 17, 2022

Summary:
Add table_properties_collector_factories override on the remote
side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9995

Test Plan: unittest added

Reviewed By: ajkr

Differential Revision: D36392623

Pulled By: jay-zhuang

fbshipit-source-id: 3ba031294d90247ca063d7de7b43178d38e3f66a

b84e3363

Adjust public APIs to prefer 128-bit SST unique ID (#10009) · 0070680c

由 Peter Dillinger 提交于 5月 17, 2022

Summary:
128 bits should suffice almost always and for tracking in manifest.

Note that this changes the output of sst_dump --show_properties to only show 128 bits.

Also introduces InternalUniqueIdToHumanString for presenting internal IDs for debugging purposes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10009

Test Plan: unit tests updated

Reviewed By: jay-zhuang

Differential Revision: D36458189

Pulled By: pdillinger

fbshipit-source-id: 93ebc4a3b6f9c73ee154383a1f8b291a5d6bbef5

0070680c

fix: build on risc-v (#9215) · 8b1df101

由 XieJiSS 提交于 5月 17, 2022

Summary:
Patch is modified from ~~https://reviews.llvm.org/file/data/du5ol5zctyqw53ma7dwz/PHID-FILE-knherxziu4tl4erti5ab/file~~

Tested on Arch Linux riscv64gc (qemu)

UPDATE: Seems like the above link is broken, so I tried to search for a link pointing to the original merge request. It turned out to me that the LLVM guys are cherry-picking from `google/benchmark`, and the upstream should be this:

https://github.com/google/benchmark/blob/808571a52fd6cc7e9f0788e08f71f0f4175b6673/src/cycleclock.h#L190

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9215

Reviewed By: siying, jay-zhuang

Differential Revision: D34170586

Pulled By: riversand963

fbshipit-source-id: 41b16b9f7f3bb0f3e7b26bb078eb575499c0f0f4

8b1df101

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功