提交 · b0d8ccbbcaf14cb68dd11cfa7a7586732d180df9 · kvdb / rocksdb

29 9月, 2022 1 次提交

db_stress print TestMultiGet error value in hex (#10753) · b0d8ccbb

由 Andrew Kryczka 提交于 9月 28, 2022

Summary:
Without this fix, db_crashtest.py could fail with useless output such as: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 267: invalid start byte`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10753

Reviewed By: hx235

Differential Revision: D39905809

Pulled By: ajkr

fbshipit-source-id: 50ba2cf20d206eeb168309cec137e827a34c8f0b

b0d8ccbb

28 9月, 2022 3 次提交

Add DECLARE_uint32 to gflags compatibility (#10729) · d2578ab1

由 Yanqin Jin 提交于 9月 27, 2022

Summary:
Older versions of gflags do not have `DEFINE_uint32` and `DECLARE_uint32`. In util/gflag_compat.h, we already add a hack for `DEFINE_uint32`. This PR adds a hack for `DECLARE_uint32`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10729

Test Plan:
ROCKSDB_NO_FBCODE=1 make V=1 -j16 db_stress
make check

Resolves https://github.com/facebook/rocksdb/issues/10704

Reviewed By: pdillinger

Differential Revision: D39789183

Pulled By: riversand963

fbshipit-source-id: a58747e0163dcf55dd762733aa5c40d8f0ae70a6

d2578ab1

Set options.num_levels in db_stress_test_base (#10732) · f3b359a5

由 Hui Xiao 提交于 9月 27, 2022

Summary:
An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732

Test Plan:
- Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1  -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887`
   - Check LOG to confirm single-level universal compaction is called
- Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing).
- [ongoing]CI jobs stress test

Reviewed By: ajkr

Differential Revision: D39797612

Pulled By: ajkr

fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c

f3b359a5

Remove timestamp before inserting to WBWI's index (#10742) · 7045b74b

由 Yanqin Jin 提交于 9月 27, 2022

Summary:
Currently, this original behavior should not lead to incorrect result, but will violate the contract of CompareWithTimestamp() that when a_has_ts or b_has_ts is false, the slice does not include timestamp.

Resolves https://github.com/facebook/rocksdb/issues/10709

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10742

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D39834096

Pulled By: riversand963

fbshipit-source-id: c597600f5a7820734f07d0926cdc224cea5eabe1

7045b74b

27 9月, 2022 5 次提交

Fix segfault in Iterator::Refresh() (#10739) · df492791

由 Changyu Bi 提交于 9月 26, 2022

Summary:
when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739

Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue.

Reviewed By: ajkr, riversand963

Differential Revision: D39825283

Pulled By: cbi42

fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c

df492791

Support WriteCommit policy with sync_fault_injection=1 (#10624) · aed30ddf

由 Hui Xiao 提交于 9月 26, 2022

Summary:
**Context:**
Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`).

**Summary:**
They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.

For (a)
- During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`.
- On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`.
- During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState`
- During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map.

For (b) - one major issue described below:
- TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](https://github.com/facebook/rocksdb/commit/6d06be22c083ccf185fd38dba49fde73b644b4c1) predated correctness testing.
- And we will verify those processed keys in a recovered db against their expected state.
- However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case.
- Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624

Test Plan:
1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
dbt=$db.tmp
expt=$exp.tmp

rm -rf $db $exp
mkdir -p $exp

echo "RUN 1"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2

echo "RUN 2"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2

echo "RUN 3"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```

2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL.
   - Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp
mkdir -p $exp

./db_stress \
	--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
	--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 30
kill $pid
sleep 1
```
- Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker)
```
   ./db_stress \
	--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
	--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure.

Reviewed By: ajkr, riversand963

Differential Revision: D39199373

Pulled By: hx235

fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c

aed30ddf

Add OpenSSL to docker image (#10741) · 5d7cf311

由 anand76 提交于 9月 26, 2022

Summary:
Update the docker image with OpenSSL, required by the folly build.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10741

Reviewed By: jay-zhuang

Differential Revision: D39831081

Pulled By: anand1976

fbshipit-source-id: 900154f70a456d1b6f9e384b8bdbcc227af4adbc

5d7cf311

Update HISTORY to mention PR #10724 (#10737) · 52f24117

由 Yanqin Jin 提交于 9月 26, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10737

Reviewed By: cbi42

Differential Revision: D39825386

Pulled By: riversand963

fbshipit-source-id: a3c55f2777e034d6ae6ff44ef0219d9fbbf1cc96

52f24117

Small cleanup in NonBatchedOpsStressTest::VerifyDb (#10740) · 2280b261

由 Levi Tamasi 提交于 9月 26, 2022

Summary:
The PR cleans up the logic in `NonBatchedOpsStressTest::VerifyDb` so that
the verification method is picked using a single random number generation.
It also eliminates some repeated key comparisons and makes some small
code hygiene improvements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10740

Test Plan: Ran a simple blackbox crash test.

Reviewed By: riversand963

Differential Revision: D39828646

Pulled By: ltamasi

fbshipit-source-id: 60ee5a3bb1851278f62c7d83b0c93b902ed9702e

2280b261

24 9月, 2022 2 次提交

Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724) · 07249fea

由 Yanqin Jin 提交于 9月 23, 2022

Summary:
Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724

Test Plan: make check

Reviewed By: cbi42

Differential Revision: D39756847

Pulled By: riversand963

fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876

07249fea

CI benchmarks return NUM_KEYS to previous size (#10649) · c76a90ce

由 Alan Paxton 提交于 9月 23, 2022

Summary:
Larger size is necessary to stress levels 2, 3 of LSM tree

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10649

Reviewed By: ajkr

Differential Revision: D39744515

Pulled By: jay-zhuang

fbshipit-source-id: 62ff097bfbfdfc26ff1e6290e1e3b71506b7042c

c76a90ce

23 9月, 2022 5 次提交

Clarify API comments for blob_cache/prepopulate_blob_cache (#10723) · 6d2a9832

由 Levi Tamasi 提交于 9月 23, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10723

Reviewed By: riversand963

Differential Revision: D39749277

Pulled By: ltamasi

fbshipit-source-id: 4bda94b4620a0db1fcd4309c7ad03fc23e8718cb

6d2a9832

Add C API to set avoid_unnecessary_blocking_io option (#10693) · 1b351fd9

由 walter 提交于 9月 22, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10693

Reviewed By: riversand963

Differential Revision: D39668399

Pulled By: ajkr

fbshipit-source-id: 66d46e5104c49d235a14e8df8eec3af285ab9752

1b351fd9

Use grep instead of obsolete egrep (#10701) · 4a83b16c

由 Anatol Pomozov 提交于 9月 22, 2022

Summary:
It fixes "egrep: warning: egrep is obsolescent; using grep -E" warning at the systems with newer gnu grep.

https://www.phoronix.com/news/GNU-Grep-3.8-Stop-egrep-fgrep

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10701

Reviewed By: ajkr

Differential Revision: D39737908

Pulled By: cbi42

fbshipit-source-id: 13f3ccfc1a37d18541d156a6b4c2ba24f6c66589

4a83b16c

Bump commonmarker from 0.23.4 to 0.23.6 in /docs (#10722) · 80d010a5

由 dependabot[bot] 提交于 9月 22, 2022

Summary:
Bumps [commonmarker](https://github.com/gjtorikian/commonmarker) from 0.23.4 to 0.23.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/gjtorikian/commonmarker/releases">commonmarker's releases</a>.</em></p>
<blockquote>
<h2>v0.23.6</h2>
<h2>What's Changed</h2>
<p>This release includes two updates from the upstream <code>cmark-gfm</code> library, namely:</p>
<ul>
<li><a href="https://github.com/github/cmark-gfm/releases">DoS vulnerability in autolink extension</a> per <a href="https://github.com/github/cmark-gfm/security/advisories/GHSA-cgh3-p57x-9q7q">GHSA-cgh3-p57x-9q7q</a></li>
<li><a href="https://github.com/github/cmark-gfm/releases/tag/0.29.0.gfm.5">Added <code>xmpp:</code> and <code>mailto:</code> support to the autolink extension</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/gjtorikian/commonmarker/blob/main/CHANGELOG.md">commonmarker's changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/a8f8d76fbc8c92ddb2e539a06bd93c5f8326705e"><code>a8f8d76</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/gjtorikian/commonmarker/issues/190">https://github.com/facebook/rocksdb/issues/190</a> from anticomputer/main</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/ac916346314aef3015a713f92f7b46a8c34e98ed"><code>ac91634</code></a> 💎 release 0.23.6</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/777fd3054be0e0ba18f2fda18ccc7eeeee82db92"><code>777fd30</code></a> Update cmark-upstream to <a href="https://github.com/github/cmark-gfm/commit/9d57d8a23">https://github.com/github/cmark-gfm/commit/9d57d8a23</a>...</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/7aaeb37e9780e87a9e9fbdddfe1feba1c9f360f4"><code>7aaeb37</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/gjtorikian/commonmarker/issues/188">https://github.com/facebook/rocksdb/issues/188</a> from stevenlaidlaw/update-to-0290gfm5</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/795e628a406ec67f169440ed3aa84ba9483e2700"><code>795e628</code></a> Update cmark-upstream to <a href="https://github.com/github/cmark-gfm/commit/0578e1e4f">https://github.com/github/cmark-gfm/commit/0578e1e4f</a>...</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/39d19d65300c5735efcc77b2a57b65c207c013e7"><code>39d19d6</code></a> Update cmark-upstream to <a href="https://github.com/github/cmark-gfm/commit/766f161ef">https://github.com/github/cmark-gfm/commit/766f161ef</a>...</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/63b7bf89ee1be857c5a5757d6ea678c5c759b8b9"><code>63b7bf8</code></a> Update FUNDING.yml</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/558c7275b18a7ae16136c0fc55f444458dd8cc58"><code>558c727</code></a> Bump to 0.23.5</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/41eee7265f501305834d80e3d045ed6c9df77de2"><code>41eee72</code></a> lint</li>
<li><a href="https://github.com/gjtorikian/commonmarker/commit/897e8ed07d04b902a5cc3c928852bafdab4468aa"><code>897e8ed</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/gjtorikian/commonmarker/issues/180">https://github.com/facebook/rocksdb/issues/180</a> from lumaxis/main</li>
<li>Additional commits viewable in <a href="https://github.com/gjtorikian/commonmarker/compare/v0.23.4...v0.23.6">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=commonmarker&package-manager=bundler&previous-version=0.23.4&new-version=0.23.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

 ---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).

</details>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10722

Reviewed By: riversand963

Differential Revision: D39735507

Pulled By: ajkr

fbshipit-source-id: 3a54f9542a3d2b88c2993dd8a125b0e7f990a886

80d010a5

Refactor to avoid confusing "raw block" (#10408) · ef443cea

由 Peter Dillinger 提交于 9月 22, 2022

Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)

This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:

* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).

Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408

Test Plan: existing tests

Reviewed By: akankshamahajan15

Differential Revision: D38172617

Pulled By: pdillinger

fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c

ef443cea

22 9月, 2022 7 次提交

Clarify comments for cache priorities and pool options (#10718) · 12f5a1e3

由 Levi Tamasi 提交于 9月 21, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10718

Reviewed By: riversand963

Differential Revision: D39707115

Pulled By: ltamasi

fbshipit-source-id: 59aec8c732482f063d0abaad4d9200ba57ebf437

12f5a1e3

Mention in HISTORY.md the fix in #10705 (#10720) · 93f46da1

由 Changyu Bi 提交于 9月 21, 2022

Summary:
Mention in HISTORY.md the fix in https://github.com/facebook/rocksdb/issues/10705.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10720

Reviewed By: riversand963

Differential Revision: D39709455

Pulled By: cbi42

fbshipit-source-id: 1a2c9dd8425c73f7095ddb1d0b1cca8ed35b7ef2

93f46da1

Fix platform 10 build with folly (#10708) · fb9a0258

由 anand76 提交于 9月 21, 2022

Summary:
Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708

Test Plan:
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check

Reviewed By: ajkr

Differential Revision: D39666590

Pulled By: anand1976

fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44

fb9a0258

Fix sqe->addr passed in cancel request in io_uring (#10644) · 0f4978d3

由 Akanksha Mahajan 提交于 9月 21, 2022

Summary:
Update io_uring_prep_cancel as it is now backward compatible.
Also, io_uring_prep_cancel expects sqe->addr to match with read
request submitted. It's being set wrong which is now fixed in this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644

Test Plan:
- Ran internally with lastest liburing package and on RocksDB
github repo with older version.
- Ran seekrandom regression to confirm there is no regression.

Reviewed By: anand1976

Differential Revision: D39284229

Pulled By: akankshamahajan15

fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980

0f4978d3

Fix potential memory leak in ArenaWrappedDBIter::Refresh() (#10716) · 013305af

由 Changyu Bi 提交于 9月 21, 2022

Summary:
Fix potential memory leak in ArenaWrappedDBIter::Refresh() introduced in https://github.com/facebook/rocksdb/issues/10705. See https://github.com/facebook/rocksdb/pull/10705#discussion_r976765905 for detail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10716

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D39698561

Pulled By: cbi42

fbshipit-source-id: dc0d0c6e3878eaa84f87623fbe4916b9b08b077a

013305af

Fix lint issues after enable BLACK (#10717) · dd40f83e

由 Bo Wang 提交于 9月 21, 2022

Summary:
As title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10717

Test Plan:
Unit Tests
CI

Reviewed By: riversand963

Differential Revision: D39700707

Pulled By: gitbw95

fbshipit-source-id: 54de27e695535a50159f5f6467da36aaf21bebae

dd40f83e

Fix memtable-only iterator regression (#10705) · 749b849a

由 Changyu Bi 提交于 9月 21, 2022

Summary:
when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705

Test Plan:
- `make check`
- Performance:
  - Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
  - Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
```
seek_nexts    main op/sec    https://github.com/facebook/rocksdb/issues/10705      RocksDB v7.6
0             5746568        5749033     5786180
30            2411690        3006466     2837699
1000          102556         128902      124667
```

Reviewed By: ajkr

Differential Revision: D39644221

Pulled By: cbi42

fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d

749b849a

21 9月, 2022 1 次提交

Enable BLACK for internal_repo_rocksdb (#10710) · 9e01de90

由 Bo Wang 提交于 9月 20, 2022

Summary:
Enable BLACK for internal_repo_rocksdb.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10710

Reviewed By: riversand963, zsol

Differential Revision: D39666245

Pulled By: gitbw95

fbshipit-source-id: ef364318d2bbba66e96f3211dd6a975174d52c21

9e01de90

20 9月, 2022 3 次提交

Disable tiered storage + BlobDB stress test (#10699) · 00050d46

由 Jay Zhuang 提交于 9月 19, 2022

Summary:
There're 2 knobs to disable blobdb, adding that.
Also print call stack when there's assert failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10699

Reviewed By: gitbw95

Differential Revision: D39596448

Pulled By: jay-zhuang

fbshipit-source-id: 5ce9fd0630d8b6ff1e157a2685a1e80a99997098

00050d46

Deflake CompactionServiceTest.BasicCompactions (#10697) · 92df3698

由 Jay Zhuang 提交于 9月 19, 2022

Summary:
The background compaction may still running while the test end, which would cause ASAN stack-use-after-scope error.
Explicitly close the DB before test end.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10697

Test Plan:
able to reproduce with:
```
gtest-parallel ./compaction_service_test --gtest_filter=CompactionServiceTest.BasicCompactions -r 10000 -w 100
```

Reviewed By: gitbw95

Differential Revision: D39590974

Pulled By: jay-zhuang

fbshipit-source-id: da264b2e6a276afbda7d5ff7adb9d7b8d4213d90

92df3698

Update version number and HISTORY in main branch (#10694) · 2b6f3510

由 Yanqin Jin 提交于 9月 19, 2022

Summary:
This PR bumps up version number from 7.7 to 7.8 in main branch, indicating that next release will be 7.8. We are going to release 7.7 soon. Since 7.7.fb branch has been created, we can land this to main.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10694

Reviewed By: gitbw95

Differential Revision: D39581577

Pulled By: riversand963

fbshipit-source-id: 84f3fecf25fd9ac96e46b4cd6d50ddb6edc89427

2b6f3510

19 9月, 2022 1 次提交

Fix invalid reference in MultiGet due to vector resizing (#10702) · 01ebe8a5

由 anand76 提交于 9月 18, 2022

Summary:
Fix invalid reference in MultiGet due to resizing of the ```batches``` autovector.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10702

Test Plan: Run asan crash test

Reviewed By: riversand963

Differential Revision: D39608753

Pulled By: anand1976

fbshipit-source-id: 7a9e7fc6f436f08eb22003d0e6b0e1e4dcdc1a2a

01ebe8a5

17 9月, 2022 3 次提交

Add enable_split_merge option for CompressedSecondaryCache (#10690) · 2cc5b395

由 gitbw95 提交于 9月 16, 2022

Summary:
`enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690

Test Plan:
Unit Tests
Stress Tests

Reviewed By: anand1976

Differential Revision: D39567604

Pulled By: gitbw95

fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7

2cc5b395

Fix an incorrect MultiGet assertion (#10695) · e053ccde

由 anand76 提交于 9月 16, 2022

Summary:
The assertion in ```FilePickerMultiGet::ReplaceRange()``` was incorrect. The function should only be called to replace the range after finishing the search in the current level, which is indicated by ```hit_file_ == nullptr``` i.e no more overlapping files in this level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10695

Reviewed By: gitbw95

Differential Revision: D39583217

Pulled By: anand1976

fbshipit-source-id: d4cedfb2b62fb9f3a083e9848a403ae6342f0519

e053ccde

Call experimental new clock cache HyperClockCache (#10684) · 0f91c72a

由 Peter Dillinger 提交于 9月 16, 2022

Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)

Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.

For performance tests see https://github.com/facebook/rocksdb/pull/10626

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684

Test Plan: no interesting functional changes; tests updated

Reviewed By: anand1976

Differential Revision: D39547800

Pulled By: pdillinger

fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf

0f91c72a

16 9月, 2022 7 次提交

Revamp, optimize new experimental clock cache (#10626) · 57243486

由 Peter Dillinger 提交于 9月 16, 2022

Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
  * Duplicate Inserts will sometimes go undetected and the shadow duplicate
    will age out with eviction.
  * In many cases, the older Inserted value for a given cache key will be kept
  (i.e. Insert does not support overwrite).
  * Entries explicitly erased (rather than evicted) might not be freed
  immediately in some rare cases.
  * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
  precisely as LRUCache, but is self-correcting and should only deviate by a
  very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.

## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
  Increment internal ref count at slot
  If possible hit:
    Check flags atomic (and non-atomic fields)
    If cache hit:
      Three distinct updates to 'flags' atomic
      Increment refs for internal-to-external
      Return
  Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
  Increment acquire counter in meta word (optimistic)
  If visible entry (already read meta word):
    If match (read non-atomic fields):
      Return
    Else:
      Decrement acquire counter in meta word
  Else if invisible entry (rare, already read meta word):
    Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
Else:
  Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
```

## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:

base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change

## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944

4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821

4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)

4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38

4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)

4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37

4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46

Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.

Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56

1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45

1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63

610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5

610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453

610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812

The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)

Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.

233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461

233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402

233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016

89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754

89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293

89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223

^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)

Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125

34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793

34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52

As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:

13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383

13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758

13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27

gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:

13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707

13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626

Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN

Reviewed By: anand1976

Differential Revision: D39368406

Pulled By: pdillinger

fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9

57243486

Fix some MultiGet stats (#10673) · 37b75e13

由 anand76 提交于 9月 15, 2022

Summary:
The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673

Reviewed By: akankshamahajan15

Differential Revision: D39492615

Pulled By: anand1976

fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3

37b75e13

Re-enable user-defined timestamp and subcompactions (#10689) · 088b9844

由 Yanqin Jin 提交于 9月 15, 2022

Summary:
Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions
after https://github.com/facebook/rocksdb/issues/10658.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689

Test Plan:
Make sure the following succeeds on devserver.
make crash_test_with_ts

Reviewed By: ltamasi

Differential Revision: D39556558

Pulled By: riversand963

fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149

088b9844

Fix a MultiGet crash (#10688) · c206aebd

由 anand76 提交于 9月 15, 2022

Summary:
Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688

Reviewed By: gitbw95

Differential Revision: D39556131

Pulled By: anand1976

fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458

c206aebd

move db_stress locking to `StressTest::Test*()` functions (#10678) · 6ce782be

由 Andrew Kryczka 提交于 9月 15, 2022

Summary:
One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test*()` function is responsible for acquiring and releasing its own locks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678

Reviewed By: hx235

Differential Revision: D39516401

Pulled By: ajkr

fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758

6ce782be

Support JemallocNodumpAllocator for the block/blob cache in db_bench (#10685) · 7dad4852

由 Levi Tamasi 提交于 9月 15, 2022

Summary:
The patch makes it possible to use the `JemallocNodumpAllocator` with the
block/blob caches in `db_bench`. In addition to its stated purpose of excluding
cache contents from core dumps, `JemallocNodumpAllocator` also uses
a dedicated arena and jemalloc tcaches for cache allocations, which can
reduce fragmentation and thus memory usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10685

Reviewed By: riversand963

Differential Revision: D39552261

Pulled By: ltamasi

fbshipit-source-id: b5c58eab6b7c1baa9a307d9f1248df1d7a77d2b5

7dad4852

Disable PersistentCacheTierTest.BasicTest (#10683) · b418ace3

由 Bo Wang 提交于 9月 15, 2022

Summary:
Disable this flaky test since PersistentCache is not used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10683

Test Plan: Unit Tests

Reviewed By: cbi42

Differential Revision: D39545974

Pulled By: gitbw95

fbshipit-source-id: ac53e96f6ba880e7612e325eb5ff22ee2799efed

b418ace3

15 9月, 2022 2 次提交

Tiered Storage feature doesn't support BlobDB yet (#10681) · 1cdc8411

由 Jay Zhuang 提交于 9月 15, 2022

Summary:
Disable the tiered storage + BlobDB test.
Also enable different hot data setting for Tiered compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10681

Reviewed By: ajkr

Differential Revision: D39531941

Pulled By: jay-zhuang

fbshipit-source-id: aa0595eb38d03f17638d300d2e4cc9061429bf61

1cdc8411

Refactor Compaction file cut `ShouldStopBefore()` (#10629) · 849cf1bf

由 Jay Zhuang 提交于 9月 14, 2022

Summary:
Consolidate compaction output cut logic to `ShouldStopBefore()` and move
it inside of CompactionOutputs class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629

Reviewed By: cbi42

Differential Revision: D39315536

Pulled By: jay-zhuang

fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18

849cf1bf

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功