提交 · 76383bea5df1136c95babf5f9f40b24f85e9ad8e · kvdb / rocksdb

01 4月, 2022 2 次提交

Add microbench document (#9781) · 76383bea

由 Jay Zhuang 提交于 3月 31, 2022

Summary:
Add basic microbenchmark document

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9781

Reviewed By: gitbw95

Differential Revision: D35272866

Pulled By: jay-zhuang

fbshipit-source-id: f482e652151fd05ca46e29629261833f038a6075

76383bea

Fix DB::Open() error logging (#9784) · bbcf7b19

由 sdong 提交于 3月 31, 2022

Summary:
Right now we log a wrong error when DB::Open() fails. Fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9784

Test Plan: CI runs should pass

Reviewed By: ajkr, riversand963

Differential Revision: D35290203

fbshipit-source-id: ffc640afa27f6b0a2382ee153dc43f28d9e242be

bbcf7b19

31 3月, 2022 6 次提交

Do not release and re-acquire dbmutex on memtable-switch if no listener (#9758) · de9df6e8

由 Yanqin Jin 提交于 3月 30, 2022

Summary:
There is no need to release-and-acquire immediately when no listener is registered. This is
what we have been doing for `NotifyOnFlushBegin()`, `NotifyOnFlushCompleted()`, `NotifyOnCompactionBegin()`,
`NotifyOnCompactionCompleted()`, and some other `NotifyOnXX` methods in event_helpers.cc.
Do the same for `NotifyOnMemTableSealed ()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9758

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D35159552

Pulled By: riversand963

fbshipit-source-id: 6e0aac50bd5c8f506d809b6638c33a7a28d1e87f

de9df6e8

fixing issue #8345 RocksDB does not work when using UNC network paths (#9384) · e55018a8

由 bbkot 提交于 3月 30, 2022

Summary:
Fix https://github.com/facebook/rocksdb/issues/8345
RocksDB does not work with network filesystem paths on Windows, e.g. "\\hostname\folder\..."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9384

Reviewed By: mrambacher

Differential Revision: D33830622

Pulled By: riversand963

fbshipit-source-id: 2a99dc3c94415eb1460e110784b97d71600218f1

e55018a8

Document SetOptions API (#9778) · 105d7f0c

由 Peter Dillinger 提交于 3月 30, 2022

Summary:
much needed

Some other minor tweaks also

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9778

Test Plan: existing tests

Reviewed By: ajkr

Differential Revision: D35258195

Pulled By: pdillinger

fbshipit-source-id: 974ddafc23a540aacceb91da72e81593d818f99c

105d7f0c

Add 'adaptive_readahead' and 'async_io' options to db_stress (#9750) · fd660056

由 Akanksha Mahajan 提交于 3月 30, 2022

Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9750

Test Plan:
export CRASH_TEST_EXT_ARGS=" --async_io=1 --adaptive_readahead=1;
make -j crash_test

Reviewed By: jay-zhuang

Differential Revision: D35114326

Pulled By: akankshamahajan15

fbshipit-source-id: 8b05c95be09f7aff6cb9eb757aa20a6520349d45

fd660056

Add 7.0.fb/7.1.fb to check_format_compatible.sh (#9772) · 60106b91

由 Hui Xiao 提交于 3月 30, 2022

Summary:
As titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9772

Test Plan: `./tools/check_format_compatible.sh 7.1.fb` (and manually removed 2.7.fb due to pre-existing assertion failure) passed compatibility test

Reviewed By: ajkr

Differential Revision: D35233659

Pulled By: hx235

fbshipit-source-id: 6b93263a5724d752347e04f1396628804c24a880

60106b91

Upgrade gbenchmark to 1.6.1 (#9775) · d5c34fa8

由 Jay Zhuang 提交于 3月 30, 2022

Summary:
Upgrade google benchmark to the latest 1.6.1.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9775

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D35252889

Pulled By: jay-zhuang

fbshipit-source-id: 4d60dd1c6f522d0af0b3942ae8fa88e5ae17f34a

d5c34fa8

30 3月, 2022 5 次提交

pristine code · 5a085d78

由 Jingjing Wang 提交于 3月 29, 2022

Summary:
This commit was generated using `mgt import`.
pristine code for third-party libraries:
third-party/benchmark

upgrade google benchmark to v1.6.1

contains a local patch that reverts [this](https://github.com/google/benchmark/pull/1227?fbclid=IwAR2CCmIJmjU62SPPQQf_t8kdAsMjYv_Pa_GxabYUOdQpGPZUHKwbnYS_1oE) and changs `enum Flags` to be `enum Flags : uint32_t`.

Reviewed By: chadaustin

Differential Revision: D35136540

fbshipit-source-id: f3662f953cd87956e5e9b767e55e3697f99d3b49

5a085d78

Fix FileStorageInfo fields from GetLiveFilesMetaData (#9769) · 40e3f30a

由 Peter Dillinger 提交于 3月 29, 2022

Summary:
In making `SstFileMetaData` inherit from `FileStorageInfo`, I
overlooked setting some `FileStorageInfo` fields when then default
`SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`.

Also removed some buggy `static_cast<size_t>`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769

Test Plan: Updated tests

Reviewed By: jay-zhuang

Differential Revision: D35220383

Pulled By: pdillinger

fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8

40e3f30a

Fix broken zlib dependency, update it from 1.2.11 to 1.2.12 (#9764) · 5dbdb197

由 Jack Robison 提交于 3月 29, 2022

Summary:
Zlib (https://www.zlib.net/) has been updated to 1.2.12 due to CVE-2018-25032

- https://nvd.nist.gov/vuln/detail/CVE-2018-25032
- https://github.com/madler/zlib/issues/605

The source .tar.gz is no longer available, and the Makefile for rocksdb now fails as a result. This PR updates the dependency to the newer (and available) version, 1.2.12

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9764

Reviewed By: ajkr

Differential Revision: D35220367

Pulled By: jay-zhuang

fbshipit-source-id: 1f68ff8f048a6dba42077f048ac143468f0e2478

5dbdb197

Update the version of Visual Studio required (#9765) · f61df652

由 Adam Retter 提交于 3月 29, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9765

Reviewed By: ajkr

Differential Revision: D35220757

Pulled By: jay-zhuang

fbshipit-source-id: b7749aa9bd04e3c3d7757e5e64921ff422600ec0

f61df652

Fb 9718 verify checksums is ignored (#9767) · b6ad0d95

由 Alan Paxton 提交于 3月 29, 2022

Summary:
Fixes https://github.com/facebook/rocksdb/issues/9718

The verify_checksums flag of read_options should be passed to the read options used by the BlockFetcher in a couple of cases where it is not at present. It will now happen (but did not, previously) on iteration and on [multi]get, where a fetcher is created as part of the iterate/get call.

This may result in much better performance in a few workloads where the client chooses to remove verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9767

Reviewed By: mrambacher

Differential Revision: D35218986

Pulled By: jay-zhuang

fbshipit-source-id: 329d29764bb70fbc7f2673440bc46c107a813bc8

b6ad0d95

29 3月, 2022 1 次提交

Update HISTORY for db_bench changes (#9759) · a5e51305

由 Mark Callaghan 提交于 3月 28, 2022

Summary:
These should have been part of the original PRs that changed db_bench, but I forgot to do that.
The PRs are:
* https://github.com/facebook/rocksdb/pull/9740
* https://github.com/facebook/rocksdb/pull/9733

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9759

Test Plan: No test needed.

Reviewed By: jay-zhuang

Differential Revision: D35159553

Pulled By: mdcallag

fbshipit-source-id: b44d075527309ee0bd4c5a92e5dd94ebf72f363e

a5e51305

26 3月, 2022 5 次提交

Fix some errors in async prefetching in FilePrefetchBuffer (#9734) · 33f8a08a

由 Akanksha Mahajan 提交于 3月 25, 2022

Summary:
In ReadOption `async_io` which prefetches the data asynchronously, db_bench and db_stress runs were failing  because wrong data was prefetched which resulted in Error: Checksum mismatched. Wrong data was copied because capacity was less than actual size needed. It has been fixed in this PR.

Since there are two separate methods for async and sync prefetching, these changes are in async prefetching methods and any changes would not effect normal prefetching. I ran the regressions to make sure normal prefetching is fine.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9734

Test Plan:
1. CircleCI jobs

2.  Ran db_bench
```
. /db_bench -use_existing_db=true
-db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32
-value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680
-duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1

```
3. Ran db_stress test
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --adaptive_readahead=1"
make crash_test -j
```

4. Run regressions for async_io disabled.

Old flow without any async changes:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

With async prefetching changes and async_io disabled to make sure in normal prefetching there is no regression.
 ```
 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 --async_io=0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.1
Date:       Wed Mar 23 15:56:37 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  481819.816 micros/op 2 ops/sec;  340.2 MB/s (250 of 250 found)
```

Reviewed By: riversand963

Differential Revision: D35058471

Pulled By: akankshamahajan15

fbshipit-source-id: 9233a1e6d97cea0c7a8111bfb9e8ac3251c341ce

33f8a08a

Correctly set ThreadState::tid (#9757) · 37de4e1d

由 Mark Callaghan 提交于 3月 25, 2022

Summary:
Fixes a bug introduced by me in https://github.com/facebook/rocksdb/pull/9733
That PR added a counter so that the per-thread seeds in ThreadState would
be unique even when --benchmarks had more than one test. But it incorrectly
used this counter as the value for ThreadState::tid as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9757

Test Plan:
Confirm that unexpectedly good QPS results on the regression tests return
to normal with this fix. I have confirmed that the QPS increase starts with
the PR 9733 diff.

Reviewed By: jay-zhuang

Differential Revision: D35149303

Pulled By: mdcallag

fbshipit-source-id: dee5cc36b7faaba6c3be6d6a253d3c2eaad72864

37de4e1d

Clarify Options::rate_limiter api doc for #9607 Rate-limit automatic WAL flush... · e2cb9aa2

由 Hui Xiao 提交于 3月 25, 2022

Clarify Options::rate_limiter api doc for #9607 Rate-limit automatic WAL flush after each user write (#9745)

Summary:
As title for https://github.com/facebook/rocksdb/pull/9607

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9745

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D35096901

Pulled By: hx235

fbshipit-source-id: 6bd3671baecfdc04579b0a81a957bfaa7bed81e1

e2cb9aa2

jni: uniformly use GetByteArrayRegion() to copy bytes (#9380) · b83263bb

由 Jermy Li 提交于 3月 25, 2022

Summary:
Uniformly use GetByteArrayRegion() instead of GetByteArrayElements()
to copy bytes.
In addition, it can avoid an inefficient ReleaseByteArrayElements()
operation.
Some benefits of GetByteArrayRegion() can be referred to:
https://stackoverflow.com/a/2480493

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9380

Reviewed By: ajkr

Differential Revision: D35135474

Pulled By: jay-zhuang

fbshipit-source-id: a32c1774d37f2d22b9bcd105d83e0bb984b71b54

b83263bb

db_bench should use a good seed when --seed is not set or set to 0 (#9740) · 1a130fa3

由 Mark Callaghan 提交于 3月 25, 2022

Summary:
This is for https://github.com/facebook/rocksdb/issues/9737

I have wasted more than a few hours running db_bench benchmarks where --seed was not set
and getting better than expected results because cache hit rates are great because
multiple invocations of db_bench used the same value for --seed or did not set it,
and then all used 0. The result is that all see the same sequence of keys.

Others have done the same. The problem is worse in that it is easy to miss and the result is a benchmark with results that are misleading.

A good way to avoid this is to set it to the equivalent of gettimeofday() when either
--seed is not set or it is set to 0 (the default).

With this change the actual seed is printed when it was 0 at process start:
  Set seed to 1647992570365606 because --seed was 0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9740

Test Plan:
Perf results:

./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000
  readrandom   :       6.469 micros/op 154583 ops/sec;   17.1 MB/s (4000000 of 4000000 found)

./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=0
  readrandom   :       6.565 micros/op 152321 ops/sec;   16.9 MB/s (4000000 of 4000000 found)

./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=1
  readrandom   :       6.461 micros/op 154777 ops/sec;   17.1 MB/s (4000000 of 4000000 found)

./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=2
  readrandom   :       6.525 micros/op 153244 ops/sec;   17.0 MB/s (4000000 of 4000000 found)

Reviewed By: jay-zhuang

Differential Revision: D35145361

Pulled By: mdcallag

fbshipit-source-id: 2b35b153ccec46b27d7c9405997523555fc51267

1a130fa3

25 3月, 2022 9 次提交

Enable READ_BLOCK_COMPACTION_MICROS to track stats (#9722) · 98130c5a

由 myasuka 提交于 3月 24, 2022

Summary:
After commit [d642c60b](https://github.com/facebook/rocksdb/commit/d642c60bdc100f7509ca77b383cd47b51d80d810), the stats `READ_BLOCK_COMPACTION_MICROS` cannot record any compaction read duration, and it always report zero.

This PR targets to distinguish `READ_BLOCK_COMPACTION_MICROS` with `READ_BLOCK_GET_MICROS` so that `READ_BLOCK_COMPACTION_MICROS` could record the correct stats.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9722

Reviewed By: ajkr

Differential Revision: D35021870

Pulled By: jay-zhuang

fbshipit-source-id: f1a804994265e51465de64c2a08f2e0eeb6fc5a3

98130c5a

Fix make clean fail after java build (#9710) · 81d1cdca

由 Jay Zhuang 提交于 3月 24, 2022

Summary:
Seems clean-rocksjava and clean-rocks conflict.
Also remove unnecessary step in java CI build, otherwise it will rebuild
the code again as java make sample do clean up first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9710

Test Plan: `make rocksdbjava && make clean` should return success

Reviewed By: riversand963

Differential Revision: D35122872

Pulled By: jay-zhuang

fbshipit-source-id: 2a15b83e7a763c0fc0e42e1f35aac9551f951ece

81d1cdca

Add --slow_usecs option to determine when long op message is printed (#9732) · 409635cb

由 Mark Callaghan 提交于 3月 24, 2022

Summary:
This adds the --slow_usecs option with a default value of 1M. Operations that
take this much time have a message printed when --histogram=1, --stats_interval=0
and --stats_interval_seconds=0. The current code hardwired this to 20,000 usecs
and for some stress tests that reduced throughput by 20% or more.

This is for https://github.com/facebook/rocksdb/issues/9620

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9732

Test Plan:
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100 --histogram=1
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100000 --histogram=1

Reviewed By: jay-zhuang

Differential Revision: D35121522

Pulled By: mdcallag

fbshipit-source-id: daf27f937efd748980545d6395db332712fc078b

409635cb

Fix heap use-after-free race with DropColumnFamily (#9730) · cad80997

由 Peter Dillinger 提交于 3月 24, 2022

Summary:
Although ColumnFamilySet comments say that DB mutex can be
freed during iteration, as long as you hold a ref while releasing DB
mutex, this is not quite true because UnrefAndTryDelete might delete cfd
right before it is needed to get ->next_ for the next iteration of the
loop.

This change solves the problem by making a wrapper class that makes such
iteration easier while handling the tricky details of UnrefAndTryDelete
on the previous cfd only after getting next_ in operator++.

FreeDeadColumnFamilies should already have been obsolete; this removes
it for good. Similarly, ColumnFamilySet::iterator doesn't need to check
for cfd with 0 refs, because those are immediately deleted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730

Test Plan:
was reported with ASAN on unit tests like
DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching

Reviewed By: ltamasi

Differential Revision: D35038143

Pulled By: pdillinger

fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9

cad80997

Extend Java RocksDB iterators to support indirect Byte Buffers (#9222) · dec144f1

由 Alan Paxton 提交于 3月 24, 2022

Summary:
Extend Java RocksDB iterators to support indirect byte buffers, to add to the existing support for direct byte buffers.
Code to distinguish direct/indirect buffers is switched in Java, and a 2nd separate JNI call implemented to support indirect
buffers. Indirect support passes contained buffers using byte[]

There are some Java subclasses of iterator (WBWIIterator, SstFileReaderIterator) which also now have parallel JNI support functions implemented, along with direct/indirect switches in Java methods.

Closes https://github.com/facebook/rocksdb/issues/6282

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9222

Reviewed By: ajkr

Differential Revision: D35115283

Pulled By: jay-zhuang

fbshipit-source-id: f8d5d20b975aef700560fbcc99f707bb028dc42e

dec144f1

Add new checksum type kXXH3 to Java API (#9749) · 8ae0c33a

由 Alan Paxton 提交于 3月 24, 2022

Summary:
Fix https://github.com/facebook/rocksdb/issues/9720

And make a couple of incidental tests test the thing they were meant to test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9749

Reviewed By: ajkr

Differential Revision: D35115298

Pulled By: jay-zhuang

fbshipit-source-id: d687d1f070d29216be9693601c71131bbea87c79

8ae0c33a

db_bench should fail on bad values for --compaction_fadvice and... · f219e3d5

由 Mark Callaghan 提交于 3月 24, 2022

db_bench should fail on bad values for --compaction_fadvice and --value_size_distribution_type (#9741)

Summary:
db_bench quietly parses and ignores bad values for --compaction_fadvice and --value_size_distribution_type
I prefer that it fail for them as it does for bad option values in most other cases. Otherwise a benchmark
result will be provided for the wrong configuration and the result will be misleading.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9741

Test Plan:
These now fail:
./db_bench --compaction_fadvice=noney
Unknown compaction fadvice:noney

./db_bench --value_size_distribution_type=norma
Cannot parse distribution type 'norma'

While correct values continue to work:
 ./db_bench --value_size_distribution_type=normal
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags

./db_bench --compaction_fadvice=none
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags

Reviewed By: siying

Differential Revision: D35115973

Pulled By: mdcallag

fbshipit-source-id: c2b10de5c2d1ea7c7539e676f5bd556351f5d370

f219e3d5

Add two new targets to determinator (#9753) · 862304a1

由 Yanqin Jin 提交于 3月 24, 2022

Summary:
Test plan
```
build_tools/rocksdb-lego-determinator stress_crash_with_multiops_wc_txn
build_tools/rocksdb-lego-determinator stress_crash_with_multiops_wp_txn
```

Spot check the printed job spec.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9753

Reviewed By: jay-zhuang

Differential Revision: D35117116

Pulled By: riversand963

fbshipit-source-id: a7ed82e8cb9bc2fd13f4f00291c6a39457415fb0

862304a1

Remove DBGet P95/P99 benchmark metrics (#9742) · 18463f8c

由 Jay Zhuang 提交于 3月 24, 2022

Summary:
DBGet p95 and p99 have high variation, remove them for now.
Also increase the iteration to 3 to avoid false positive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9742

Test Plan: Internal CI

Reviewed By: ajkr

Differential Revision: D35082820

Pulled By: jay-zhuang

fbshipit-source-id: facc1d56b94e54aa8c8852c207aae2ae4e4924b0

18463f8c

24 3月, 2022 8 次提交

Avoid seed reuse when --benchmarks has more than one test (#9733) · d583d23d

由 Mark Callaghan 提交于 3月 24, 2022

Summary:
When --benchmarks has more than one test then the threads in one benchmark
will use the same set of seeds as the threads in the previous benchmark.
This diff fixe that.

This fixes https://github.com/facebook/rocksdb/issues/9632

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9733

Test Plan:
For this command line the block cache is 8GB, so it caches at most 1024 8KB blocks. Note that without
this diff the second run of readrandom has a much better response time because seed reuse means the
second run reads the same 1000 blocks as the first run and they are cached at that point. But with
this diff that does not happen.

./db_bench --benchmarks=fillseq,flush,compact0,waitforcompaction,levelstats,readrandom,readrandom --compression_type=zlib --num=10000000 --reads=1000 --block_size=8192

...

```
Level Files Size(MB)
--------------------
  0        0        0
  1       11      238
  2        9      253
  3        0        0
  4        0        0
  5        0        0
  6        0        0
```

 --- perf results without this diff

DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom   :      46.212 micros/op 21618 ops/sec;    2.4 MB/s (1000 of 1000 found)

DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom   :      21.963 micros/op 45450 ops/sec;    5.0 MB/s (1000 of 1000 found)

 --- perf results with this diff

DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom   :      47.213 micros/op 21126 ops/sec;    2.3 MB/s (1000 of 1000 found)

DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom   :      42.880 micros/op 23299 ops/sec;    2.6 MB/s (1000 of 1000 found)

Reviewed By: jay-zhuang

Differential Revision: D35089763

Pulled By: mdcallag

fbshipit-source-id: 1b50143a07afe876b8c8e5fa50dd94a8ce57fc6b

d583d23d

Revise history of 7.1.0 for patch (#9746) · 727d11ce

由 Peter Dillinger 提交于 3月 24, 2022

Summary:
This updates main branch with a HISTORY update going into
7.1.fb branch before tagging 7.1.0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9746

Test Plan: HISTORY.md only

Reviewed By: ajkr, hx235

Differential Revision: D35099194

Pulled By: pdillinger

fbshipit-source-id: b74ea8b626118dac235e387038420829850b8da2

727d11ce

Add new determinators for multiops transactions stress test (#9708) · c18c4a08

由 Yanqin Jin 提交于 3月 23, 2022

Summary:
Add determinators for multiops transactions stress test with
write-committed and write-prepared policies.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9708

Test Plan: Internal CI

Reviewed By: jay-zhuang

Differential Revision: D34967263

Pulled By: riversand963

fbshipit-source-id: 170a0842d56dccb6ed6bc0c5adfd33849acd6b31

c18c4a08

Fix a race condition in WAL tracking causing DB open failure (#9715) · e0c84aa0

由 Yanqin Jin 提交于 3月 23, 2022

Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst

```
Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V
```

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715

Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```

Reviewed By: ltamasi

Differential Revision: D34984412

Pulled By: riversand963

fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005

e0c84aa0

Return invalid argument if batch is null (#9744) · 29bec740

由 Yanqin Jin 提交于 3月 23, 2022

Summary:
Originally, a corruption will be returned by `DBImpl::WriteImpl(batch...)` if batch is
null. This is inaccurate since there is no data corruption.
Return `Status::InvalidArgument()` instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9744

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35086268

Pulled By: riversand963

fbshipit-source-id: 677397b007a53bc25210eac0178d49c9797b5951

29bec740

db_bench should fail when an option uses an invalid compression type (#9729) · 6904fd0c

由 Mark Callaghan 提交于 3月 23, 2022

Summary:
This changes db_bench to fail at startup for invalid compression types. It had been
changing them to Snappy. For other invalid options it fails at startup.

This is for https://github.com/facebook/rocksdb/issues/9621

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9729

Test Plan:
This continues to work:
./db_bench --benchmarks=fillrandom --compression_type=lz4

This now fails rather than changing the compression type to Snappy
./db_bench --benchmarks=fillrandom --compression_type=lz44
Cannot parse compression type 'lz44'

Reviewed By: jay-zhuang

Differential Revision: D35081323

Pulled By: mdcallag

fbshipit-source-id: 9b38c835abddce11aa7feb235df63f53cf829981

6904fd0c

Fix a major performance bug in 7.0 re: filter compatibility (#9736) · 91687d70

由 Peter Dillinger 提交于 3月 23, 2022

Summary:
Bloom filters generated by pre-7.0 releases are not read by
7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on
upgrade or downgrade with existing DB, but not data correctness.

To fix, we go back using the old, unified name in SST metadata but (for
a while anyway) recognize the aliases that could be generated by early
7.0.x releases. This unfortunately requires a public API change to avoid
interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change
only affects users with custom FilterPolicy, which should be very few.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736

Test Plan:
manual

Generate DBs with
```
./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
```
and similar. Compare with
```
for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
```

Results:
```
Testing 6.29 on 6.29:
readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
Testing 6.29 on 7.0:
readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
Testing 6.29 on fixed:
readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
Testing 7.0 on 6.29:
readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
Testing 7.0 on 7.0:
readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
Testing 7.0 on fixed:
readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
Testing fixed on 6.29:
readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
Testing fixed on 7.0:
readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
Testing fixed on fixed:
readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
```

Just paying attention to order of magnitude of ops/sec (short test
durations, lots of noise), it's clear that with the fix we can read <= 6.29
& >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
release can properly read fixed DB at full speed.

Reviewed By: siying, ajkr

Differential Revision: D35057844

Pulled By: pdillinger

fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050

91687d70

Add number of running flushes & compactions to --stats_per_interval output (#9726) · d71e5a5b

由 Mark Callaghan 提交于 3月 23, 2022

Summary:
This is for https://github.com/facebook/rocksdb/issues/9709 and add two lines to the end of DB Stats
for num-running-compactions and num-running-flushes.

For example ...

** DB Stats **
Uptime(secs): 6.0 total, 1.0 interval
Cumulative writes: 915K writes, 915K keys, 915K commit groups, 1.0 writes per commit group, ingest: 0.11 GB, 18.95 MB/s
Cumulative WAL: 915K writes, 0 syncs, 915000.00 writes per sync, written: 0.11 GB, 18.95 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 133K writes, 133K keys, 133K commit groups, 1.0 writes per commit group, ingest: 16.62 MB, 16.53 MB/s
Interval WAL: 133K writes, 0 syncs, 133000.00 writes per sync, written: 0.02 GB, 16.53 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
num-running-compactions: 0
num-running-flushes: 0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9726

Reviewed By: jay-zhuang

Differential Revision: D35066759

Pulled By: mdcallag

fbshipit-source-id: c161fadd3c15c5aa715a820dab6bfedb46dc099b

d71e5a5b

23 3月, 2022 4 次提交

Print information about all column families when using ldb (#9719) · 3bd150c4

由 Yanqin Jin 提交于 3月 22, 2022

Summary:
Before this PR, the following command prints only the default column
family's information in the end:
```
ldb --db=. --hex manifest_dump --verbose
```

We should print all column families instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9719

Test Plan:
`make check` makes sure nothing breaks.

Generate a DB, use the above command to verify all column families are
printed.

Reviewed By: akankshamahajan15

Differential Revision: D34992453

Pulled By: riversand963

fbshipit-source-id: de1d38c4539cd89f74e1a6240ad7a6e2416bf198

3bd150c4

Add async_io read option in db_bench (#9735) · f07eec1b

由 Akanksha Mahajan 提交于 3月 22, 2022

Summary:
Add async_io Read option in db_bench

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9735

Test Plan:
./db_bench -use_existing_db=true
-db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32
-value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680
-duration=120 -ops_between_duration_checks=1 -async_io=1

Reviewed By: riversand963

Differential Revision: D35058482

Pulled By: akankshamahajan15

fbshipit-source-id: 1522b638c79f6d85bb7408c67f6ab76dbabeeee7

f07eec1b

For db_bench --benchmarks=fillseq with --num_multi_db load databases … (#9713) · 63a284a6

由 Mark Callaghan 提交于 3月 22, 2022

Summary:
…in order

This fixes https://github.com/facebook/rocksdb/issues/9650
For db_bench --benchmarks=fillseq --num_multi_db=X it loads databases in sequence
rather than randomly choosing a database per Put. The benefits are:
1) avoids long delays between flushing memtables
2) avoids flushing memtables for all of them at the same point in time
3) puts same number of keys per database so that query tests will find keys as expected

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9713

Test Plan:
Using db_bench.1 without the change and db_bench.2 with the change:

for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq --num_multi_db=4 --num=10000000; du -hs /data/m/rx ; done

 --- without the change
    fillseq      :       3.188 micros/op 313682 ops/sec;   34.7 MB/s
    real    2m7.787s
    user    1m52.776s
    sys     0m46.549s
    2.7G    /data/m/rx

 --- with the change

    fillseq      :       3.149 micros/op 317563 ops/sec;   35.1 MB/s
    real    2m6.196s
    user    1m51.482s
    sys     0m46.003s
    2.7G    /data/m/rx

    Also, temporarily added a printf to confirm that the code switches to the next database at the right time
    ZZ switch to db 1 at 10000000
    ZZ switch to db 2 at 20000000
    ZZ switch to db 3 at 30000000

for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq,readrandom --num_multi_db=4 --num=100000; du -hs /data/m/rx ; done

 --- without the change, smaller database, note that not all keys are found by readrandom because databases have < and > --num keys

    fillseq      :       3.176 micros/op 314805 ops/sec;   34.8 MB/s
    readrandom   :       1.913 micros/op 522616 ops/sec;   57.7 MB/s (99873 of 100000 found)

 --- with the change, smaller database, note that all keys are found by readrandom

    fillseq      :       3.110 micros/op 321566 ops/sec;   35.6 MB/s
    readrandom   :       1.714 micros/op 583257 ops/sec;   64.5 MB/s (100000 of 100000 found)

Reviewed By: jay-zhuang

Differential Revision: D35030168

Pulled By: mdcallag

fbshipit-source-id: 2a18c4ec571d954cf5a57b00a11802a3608823ee

63a284a6

Update Cache::Release param from force_erase to erase_if_last_ref (#9728) · 8102690a

由 gitbw95 提交于 3月 22, 2022

Summary:
The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728

Reviewed By: pdillinger

Differential Revision: D35038673

Pulled By: gitbw95

fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f

8102690a

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功