提交 · 01a2e202994ffe540d55181f1da1d70d955956c6 · kvdb / rocksdb

26 7月, 2022 3 次提交

Account for DB ID in stress testing block cache keys (#10388) · 01a2e202

由 Peter Dillinger 提交于 7月 25, 2022

Summary:
I recently discovered that block cache keys are slightly lower
quality than previously thought, because my stress testing tool failed
to simulate the effect of DB ID differences. This change updates the
tool and gives us data to guide future developments. (No changes to
production code here and now.)

Nevertheless, the following promise still holds

```
// In fact, if our SST files are all < 4TB (see
// BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated
// in a single process are guaranteed to have unique cache keys, unless/until
// number session ids * max file number = 2**86 ...
```

because although different DB IDs could cause collision in file number
and offset data, that would have to be using the same DB session (lower)
to cause a block cache key collision, which is not possible in the same
process. (A session is associated with only one DB ID.)

This change fixes cache_bench -stress_cache_key to set and reset DB IDs in
a parameterized way to evaluate the effect. Previous results assumed to
be representative (using -sck_keep_bits=43):

```
15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected)
```

or expected collision on a single machine every 104 billion billion
days (see "corrected" value).

After accounting for DB IDs, test never really changing, intermediate, and very
frequently changing (using default -sck_db_count=100):

```
-sck_newdb_nreopen=1000000000:
15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected)
-sck_newdb_nreopen=10000:
17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected)
-sck_newdb_nreopen=100:
19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected)
```

or roughly 10x more often than previously thought (still extremely if
not impossibly rare), and better than random base cache keys
(with -sck_randomize), though < 10x better than random:

```
31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected)
```

If we simply fixed this by ignoring DB ID for cache keys, we would
potentially have a shortage of entropy for some cases, such as small
file numbers and offsets (e.g. many short-lived processes each using
SstFileWriter to create a small file), because existing DB session IDs
only provide ~103 bits of entropy. We could upgrade the entropy in DB
session IDs to accommodate, but it's not known what all would be
affected by changing from 20 digit session IDs to something larger.

Instead, my plan is to
1) Move to block cache keys derived from SST unique IDs (so that we can
derive block cache keys from manifest data without reading file on
storage), and show no significant regression in expected collision
rate.
2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058),
which should have ~100x lower expected/predicted collision rate based
on simulations with this stress test:
```
./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id
...
15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388

Test Plan: no production changes

Reviewed By: jay-zhuang

Differential Revision: D37986714

Pulled By: pdillinger

fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df

01a2e202

Fix a bug in hash linked list (#10401) · 4e007480

由 sdong 提交于 7月 25, 2022

Summary:
In hash linked list, with a bucket of only one record, following sequence can cause users to temporarily miss a record:

Thread 1: Fetch the structure bucket x points too, which would be a Node n1 for a key, with next pointer to be null
Thread 2: Insert a key to bucket x that is larger than the existing key. This will make n1->next points to a new node n2, and update bucket x to point to n1.
Thread 1: see n1->next is not null, so it thinks it is a header of linked list and ignore the key of n1.

Fix it by refetch structure that bucket x points to when it sees n1->next is not null. This should work because if n1->next is not null, bucket x should already point to a linked list or skip list header.

A related change is to revert th order of testing for linked list and skip list. This is because after refetching the bucket, it might end up with a skip list, rather than linked list.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10401

Test Plan: Run existing tests and make sure at least it doesn't regress.

Reviewed By: jay-zhuang

Differential Revision: D38064471

fbshipit-source-id: 142bb85e1546c803f47e3357aef3e76debccd8df

4e007480

Lock-free ClockCache (#10390) · 6a160e1f

由 Guido Tagliavini Ponce 提交于 7月 25, 2022

Summary:
ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390

Test Plan:
- ``make -j24 check``
- ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``

Reviewed By: pdillinger

Differential Revision: D38106945

Pulled By: guidotag

fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1

6a160e1f

25 7月, 2022 1 次提交

Support subcmpct using reserved resources for round-robin priority (#10341) · 8860fc90

由 Zichen Zhu 提交于 7月 24, 2022

Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
 * Constraint 1: We can only pick consecutive files
   - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
   - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
 * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
 * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
 * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3

More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.

The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit.  The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)

More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341

Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`

Reviewed By: ajkr, hx235

Differential Revision: D37792644

Pulled By: littlepig2013

fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657

8860fc90

24 7月, 2022 1 次提交

Improve SubCompaction Partitioning (#10393) · 252bea40

由 sdong 提交于 7月 23, 2022

Summary:
Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393

Test Plan:
1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
2. Add a new unit test to ApproximateKeyAnchors()
3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.

Reviewed By: jay-zhuang

Differential Revision: D38043783

fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f

252bea40

23 7月, 2022 6 次提交

Remove Travis CI (#10407) · fcccc412

由 Jay Zhuang 提交于 7月 22, 2022

Summary:
Travis CI is depreciated and haven't been maintained for some time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407

Reviewed By: ajkr

Differential Revision: D38078382

Pulled By: jay-zhuang

fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb

fcccc412

fix typos in some code and comment (#10139) · bfc737da

由 Yu Zhao 00540916 提交于 7月 22, 2022

Summary:
Minor issue, I just found a few typos on db_test and column_family while reading the code. And I have this PR opened to contribute.  :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10139

Reviewed By: ajkr

Differential Revision: D38007098

Pulled By: jay-zhuang

fbshipit-source-id: 511947b32424c34348184691216640f32c410fb1

bfc737da

Fix WAL compression fragmentation test (#10402) · 7b447242

由 Andrew Kryczka 提交于 7月 22, 2022

Summary:
Previously the "Fragmentation" test didn't cover fragmentation because the WAL data was compressible into trivial size. This PR changes it to use random data so the post-compression size is large enough to require fragmentation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10402

Reviewed By: cbi42

Differential Revision: D38065596

Pulled By: ajkr

fbshipit-source-id: 0d5f89ca14d33546501a74b5d4fafbadc28a46a7

7b447242

Fix build error due to uninitialized read_req (#10312) · 5cf18c76

由 Jun He 提交于 7月 22, 2022

Summary:
GCC-12 has strick check on variables, and thus
build fails when it finds read_req is not properly
initialized (-Werror=maybe-uninitialized). Add
default value to fix this.

Change-Id: Ib8a9085e2d613ee7b943b58a6a58e1bc351725d7
Signed-off-by: NJun He <jun.he@arm.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10312

Reviewed By: riversand963

Differential Revision: D37656997

Pulled By: ajkr

fbshipit-source-id: fe47492c913b34b3a03c04beeec9ec57831dcaff

5cf18c76

Fix underflow in FIFOCompactionPicker (#10386) · 8885b053

由 LIU HU 提交于 7月 22, 2022

Summary:
Fix https://github.com/facebook/rocksdb/issues/10133

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10386

Reviewed By: riversand963

Differential Revision: D38067265

Pulled By: ajkr

fbshipit-source-id: 3a99a98ac5d7ac37581b5b636fbfa7901563d834

8885b053

Print perf context for all benchmarks if enabled (#10396) · dd759537

由 Yanqin Jin 提交于 7月 22, 2022

Summary:
If user runs `db_bench` with `-perf_level=2` or higher, db_bench should
print perf context after each of all benchmarks.

Or make `-perf_level` a per-benchmark switch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396

Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2

Reviewed By: ajkr

Differential Revision: D38016324

Pulled By: riversand963

fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0

dd759537

22 7月, 2022 3 次提交

Bump tzinfo from 1.2.9 to 1.2.10 in /docs (#10400) · 944ace8f

由 dependabot[bot] 提交于 7月 22, 2022

Summary:
Bumps [tzinfo](https://github.com/tzinfo/tzinfo) from 1.2.9 to 1.2.10.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/tzinfo/tzinfo/releases">tzinfo's releases</a>.</em></p>
<blockquote>
<h2>v1.2.10</h2>
<ul>
<li>Fixed a relative path traversal bug that could cause arbitrary files to be loaded with require when used with <code>RubyDataSource</code>. Please refer to
<a href="https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx">https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx</a> for details. CVE-2022-31163.</li>
<li>Ignore the SECURITY file from Arch Linux's tzdata package. <a href="https://github-redirect.dependabot.com/tzinfo/tzinfo/issues/134">https://github.com/facebook/rocksdb/issues/134</a>.</li>
</ul>
<p><a href="https://rubygems.org/gems/tzinfo/versions/1.2.10">TZInfo v1.2.10 on RubyGems.org</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/tzinfo/tzinfo/blob/master/CHANGES.md">tzinfo's changelog</a>.</em></p>
<blockquote>
<h2>Version 1.2.10 - 19-Jul-2022</h2>
<ul>
<li>Fixed a relative path traversal bug that could cause arbitrary files to be
loaded with <code>require</code> when used with <code>RubyDataSource</code>. Please refer to
<a href="https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx">https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx</a> for
details. CVE-2022-31163.</li>
<li>Ignore the SECURITY file from Arch Linux's tzdata package. <a href="https://github-redirect.dependabot.com/tzinfo/tzinfo/issues/134">https://github.com/facebook/rocksdb/issues/134</a>.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/tzinfo/tzinfo/commit/0814dcd6195f247cc90e62a46b86ff0b76e08ed6"><code>0814dcd</code></a> Fix the release date.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/fd05e2a61cc569cef81ebd1a90d0b57f69e401bd"><code>fd05e2a</code></a> Preparing v1.2.10.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/b98c32efd61289fe6f00a50ab8061e95962ea983"><code>b98c32e</code></a> Merge branch 'fix-directory-traversal-1.2' into 1.2</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/ac3ee6828afd67e6a8ee981cba791ee34d20e9fb"><code>ac3ee68</code></a> Remove unnecessary escaping of + within regex character classes.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/9d49bf9728a6d42e55f822c497ebf362e86a65a6"><code>9d49bf9</code></a> Fix relative path loading tests.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/394c381eb6a16eaeafb81196270c363234cf1956"><code>394c381</code></a> Remove <code>private_constant</code> for consistency and compatibility.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/5e9f99086f820573eb43ffe242e074b9a8295027"><code>5e9f990</code></a> Exclude Arch Linux's SECURITY file from the time zone index.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/17fc9e1fa918c24ca8c1915419d4cc15f56b6729"><code>17fc9e1</code></a> Workaround for 'Permission denied - NUL' errors with JRuby on Windows.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/6bd7a5191d9c1ca48a97420652460b8c4dec865d"><code>6bd7a51</code></a> Update copyright years.</li>
<li><a href="https://github.com/tzinfo/tzinfo/commit/9905ca93abf7bf3e387bd592406e403cd18334c7"><code>9905ca9</code></a> Fix directory traversal in Timezone.get when using Ruby data source</li>
<li>Additional commits viewable in <a href="https://github.com/tzinfo/tzinfo/compare/v1.2.9...v1.2.10">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tzinfo&package-manager=bundler&previous-version=1.2.9&new-version=1.2.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

 ---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).

</details>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10400

Reviewed By: ajkr

Differential Revision: D38064880

Pulled By: jay-zhuang

fbshipit-source-id: 87854e33913ec14f119a090b2d3911d244b87af4

944ace8f

Correctly implement Create-/DropColumnFamilies for PessimisticTransactionDB (#10332) · 6bebe650

由 DaPorkchop_ 提交于 7月 22, 2022

Summary:
This overrides `CreateColumnFamilies` and `DropColumnFamilies` in `PessimisticTransactionDB` in order to add/remove the created column families to/from the lock manager.

Fixes https://github.com/facebook/rocksdb/issues/10322.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10332

Reviewed By: ajkr

Differential Revision: D37841079

Pulled By: riversand963

fbshipit-source-id: 854d7d9948b0089e0054a8f2875485ba44436fd2

6bebe650

Do not hold mutex when write keys if not necessary (#7516) · 1e9bf25f

由 Wallace 提交于 7月 21, 2022

Summary:
## Problem Summary
RocksDB will acquire the global mutex of db instance for every time when user calls `Write`.  When RocksDB schedules a lot of compaction jobs,   it will compete the mutex with write thread and it will hurt the write performance.

## Problem Solution:
I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516

Test Plan:
1. make check
2. CI
3. COMPILE_WITH_TSAN=1 make db_stress
make crash_test
make crash_test_with_multiops_wp_txn
make crash_test_with_multiops_wc_txn
make crash_test_with_atomic_flush

Reviewed By: siying

Differential Revision: D36908702

Pulled By: riversand963

fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe

1e9bf25f

20 7月, 2022 3 次提交

Fix explanation of XOR usage in KV checksum blog post (#10392) · a0c63083

由 Andrew Kryczka 提交于 7月 19, 2022

Summary:
Thanks pdillinger for reminding us that we are protected from swapping corruptions due to independent seeds (and for suggesting that approach in the first place).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10392

Reviewed By: cbi42

Differential Revision: D37981819

Pulled By: ajkr

fbshipit-source-id: 3ed32982ae1dbc88eb92569010f9f2e8d190c962

a0c63083

Stop operating on DB in a stress test background thread (#10373) · b443d24f

由 Yanqin Jin 提交于 7月 19, 2022

Summary:
Stress test background threads do not coordinate with test worker
threads for db reopen in the middle of a test run, thus accessing db
obj in a stress test bg thread can race with test workers. Remove the
TimestampedSnapshotThread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373

Test Plan:
```
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \
--allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \
--backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \
--block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \
--bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \
--charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \
--charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
--compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \
--compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \
--compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \
--continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \
--db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \
--detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \
--fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \
--get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
--get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \
--iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
--log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \
--max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \
--max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \
--max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \
--memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \
--nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \
--open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \
--optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \
--pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \
--prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \
--readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \
--secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \
--sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \
--subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \
--target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \
--txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \
--use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \
--use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \
--verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \
--verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \
--write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35
```
make crash_test_with_txn
make crash_test_with_multiops_wc_txn

Reviewed By: jay-zhuang

Differential Revision: D37903189

Pulled By: riversand963

fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9

b443d24f

Fix race conditions in GenericRateLimiter (#10374) · e576f2ab

由 Andrew Kryczka 提交于 7月 19, 2022

Summary:
Made locking strict for all accesses of `GenericRateLimiter` internal state.

`SetBytesPerSecond()` was the main problem since it had no locking, while the two updates it makes need to be done as one atomic operation.

The test case, "ConfigOptionsTest.ConfiguringOptionsDoesNotRevertRateLimiterBandwidth", is for the issue fixed in https://github.com/facebook/rocksdb/issues/10378, but I forgot to include the test there.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10374

Reviewed By: pdillinger

Differential Revision: D37906367

Pulled By: ajkr

fbshipit-source-id: ccde620d2a7f96d1401bdafd2bdb685cbefbafa5

e576f2ab

19 7月, 2022 6 次提交

Charge blob cache usage against the global memory limit (#10321) · 0b6bc101

由 Gang Liao 提交于 7月 18, 2022

Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72

0b6bc101

Fix seqno->time worker not scheduled with multi DB instances (#10383) · 18a61a17

由 Jay Zhuang 提交于 7月 18, 2022

Summary:
`PeriodicWorkScheduler` is a global singleton, which were used to store per-instance setting `record_seqno_time_cadence_`. Move that to db_impl.h which is per-instance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10383

Reviewed By: siying

Differential Revision: D37928009

Pulled By: jay-zhuang

fbshipit-source-id: e517754f4a9db98798ac04f72033d4b517f734e9

18a61a17

Per kv checksum blogpost (#10385) · 5b5144de

由 Changyu Bi 提交于 7月 18, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10385

Test Plan: https://github.com/facebook/rocksdb/blob/main/docs/CONTRIBUTING.md

Reviewed By: ajkr

Differential Revision: D37944670

Pulled By: cbi42

fbshipit-source-id: 963c4d973dc748d4280b9c9d82dc31c33679f22a

5b5144de

Fix hang in MultiRead with O_DIRECT and io_uring (#10368) · f6c4d7a5

由 Akanksha Mahajan 提交于 7月 18, 2022

Summary:
Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR https://github.com/facebook/rocksdb/pull/10197 and that PR is not released yet in any release branch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10368

Test Plan: Added new unit test

Reviewed By: siying

Differential Revision: D37885184

Pulled By: akankshamahajan15

fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b

f6c4d7a5

Make RateLimiter not Customizable (#10378) · 25cc564f

由 Andrew Kryczka 提交于 7月 18, 2022

Summary:
(PR created for informational/testing purposes only.)

- Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()`
- Benefit over #10374 is eliminating race conditions with Configurable framework.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378

Reviewed By: pdillinger

Differential Revision: D37914865

fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30

25cc564f

Post 7.5 branch cut changes (#10376) · d9deffba

由 sdong 提交于 7月 18, 2022

Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376

Test Plan: Watch CI

Reviewed By: ajkr

Differential Revision: D37927694

fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461

d9deffba

17 7月, 2022 2 次提交

Support prepopulating/warming the blob cache (#10298) · ec4ebeff

由 Gang Liao 提交于 7月 17, 2022

Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1

ec4ebeff

add sstfilewriter_delete_range (#10314) · f5ef36a2

由 sg20180546 提交于 7月 16, 2022

Summary:
I add C API
db/c.cc
rocksdb_sstfilewriter_delete_range

and test it , PASS.
 can you review it ? ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10314

Reviewed By: riversand963

Differential Revision: D37657236

Pulled By: ajkr

fbshipit-source-id: c3b758daa36fbd9133210b011716914dff311278

f5ef36a2

16 7月, 2022 8 次提交

Support using secondary cache with the blob cache (#10349) · 95ef007a

由 Gang Liao 提交于 7月 16, 2022

Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349

Reviewed By: ltamasi

Differential Revision: D37896773

Pulled By: gangliao

fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84

95ef007a

Lock-free Lookup and Release in ClockCache (#10347) · efdb428e

由 Guido Tagliavini Ponce 提交于 7月 15, 2022

Summary:
This is a prototype of a partially lock-free version of ClockCache. Roughly speaking, reads are lock-free and writes are lock-based:
- Lookup is lock-free.
- Release is lock-free, unless (i) no references to the element are left and (ii) it was marked for deletion or ``erase_if_last_ref`` is set.
- Insert and Erase still use a per-shard lock.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10347

Test Plan:
- ``make -j24 check``
- ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``

Reviewed By: pdillinger

Differential Revision: D37898776

Pulled By: guidotag

fbshipit-source-id: 6418fd980f786d69b871bf2fe959398e44cd3d80

efdb428e

Tiered compaction: integrate Seqno time mapping with per key placement (#10370) · faa0f972

由 Jay Zhuang 提交于 7月 15, 2022

Summary:
Using the Sequence number to time mapping to decide if a key is hot or not in
compaction and place it in the corresponding level.

Note: the feature is not complete, level compaction will run indefinitely until
all penultimate level data is cold and small enough to not trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370

Test Plan:
CI
* Run basic db_bench for universal compaction manually

Reviewed By: siying

Differential Revision: D37892338

Pulled By: jay-zhuang

fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe

faa0f972

Update HISTORY.md for the upcoming 7.5 release (#10372) · 7506c1a4

由 sdong 提交于 7月 15, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10372

Reviewed By: riversand963

Differential Revision: D37894412

fbshipit-source-id: 77055a460d662f3b89921102a16b1a726d324d84

7506c1a4

Remove fixed TODO (#10241) · fb579a22

由 Jay Zhuang 提交于 7月 15, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10241

Reviewed By: gitbw95

Differential Revision: D37369726

Pulled By: jay-zhuang

fbshipit-source-id: 1e94f0e2433aee42e9871043fa434291ce948eac

fb579a22

Add helper function to get debug type name (#10243) · dcb6a3be

由 Jay Zhuang 提交于 7月 15, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10243

Reviewed By: riversand963

Differential Revision: D37370236

Pulled By: jay-zhuang

fbshipit-source-id: 6e7a6fadf45fdfb5afe97b3f6fe4acf1260d4a86

dcb6a3be

VerifySstUniqueIds status is overrided for multi CFs (#10247) · 69a18b9b

由 Jay Zhuang 提交于 7月 15, 2022

Summary:
There's bug that basically we only report the last CF's
VerifySstUniqueIds() result:
https://github.com/facebook/rocksdb/pull/9990#discussion_r877268810

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10247

Test Plan: CI

Reviewed By: pdillinger

Differential Revision: D37384265

Pulled By: jay-zhuang

fbshipit-source-id: d462ad0eab39c9145c45a3db9c45539d5d76f7dd

69a18b9b

Add lean option to cache_bench (#10363) · a543773b

由 Guido Tagliavini Ponce 提交于 7月 15, 2022

Summary:
Sometimes we may not want to include extra computation in our cache_bench experiments. Here we add a flag to avoid any extra work. We also moved the timer start after the key generation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10363

Test Plan: Run cache_bench with and without the new flag and check that the appropriate code is being executed.

Reviewed By: pdillinger

Differential Revision: D37870416

Pulled By: guidotag

fbshipit-source-id: f853207b6643b9328e774251c3f679b1fd78a11a

a543773b

15 7月, 2022 4 次提交

DB::PutEntity() shouldn't be defined as =0 (#10364) · 00e68e7a

由 sdong 提交于 7月 14, 2022

Summary:
DB::PutEntity() is defined as 0, but it is actually implemented in db/db_impl/db_impl_write.cc. It is incorrect, and might cause problems when users implement class DB themselves.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10364

Test Plan: See existing tests pass

Reviewed By: riversand963

Differential Revision: D37874886

fbshipit-source-id: b81713ddb707720b52d57a15de56a59414c24f66

00e68e7a

Add seqno to time mapping (#10338) · a3acf2ef

由 Jay Zhuang 提交于 7月 14, 2022

Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f

a3acf2ef

Fix HISTORY.md for misplaced items (#10362) · 66685d6a

由 Siying Dong 提交于 7月 14, 2022

Summary:
Some items are misplaced to 7.4 but they are unreleased.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10362

Reviewed By: jay-zhuang

Differential Revision: D37859426

fbshipit-source-id: e2ad099227309ed2e0f3ca450a9a43986d681c7c

66685d6a

Make InternalKeyComparator not configurable (#10342) · c8b20d46

由 sdong 提交于 7月 14, 2022

Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342

Test Plan: Run existing CI tests and make sure it doesn't fail

Reviewed By: riversand963

Differential Revision: D37771351

fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee

c8b20d46

14 7月, 2022 3 次提交

Tiered Compaction: per key placement support (#9964) · 6ce0b2ca

由 Jay Zhuang 提交于 7月 13, 2022

Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
  output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964

Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)

Reviewed By: ajkr

Differential Revision: D36249494

Pulled By: jay-zhuang

fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3

6ce0b2ca

Revert NewClockCache signature (#10358) · 7e1b4178

由 Guido Tagliavini Ponce 提交于 7月 13, 2022

Summary:
This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358

Test Plan: ``make -j24 check`` and re-run the pre-release tests.

Reviewed By: siying

Differential Revision: D37832601

Pulled By: guidotag

fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950

7e1b4178

Added WAL compression checksum (#10319) · 5f9fe7f2

由 Changyu Bi 提交于 7月 13, 2022

Summary:
Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319

Test Plan:
- `make check`
- WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop).
```
sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd
```

Reviewed By: ajkr

Differential Revision: D37673311

Pulled By: cbi42

fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9

5f9fe7f2

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功