提交 · d3de59255ae87a38eef67bae3e68884041942876 · kvdb / rocksdb

28 6月, 2022 1 次提交

Enable compaction filter for db_stress with user-defined timestamp (#10259) · d3de5925

由 Yanqin Jin 提交于 6月 27, 2022

Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.

This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259

Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts

Reviewed By: ajkr

Differential Revision: D37459692

Pulled By: riversand963

fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef

d3de5925

26 6月, 2022 1 次提交

Add API for writing wide-column entities (#10242) · c73d2a9d

由 Levi Tamasi 提交于 6月 25, 2022

Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d

c73d2a9d

25 6月, 2022 4 次提交

Temporarily disable mempurge in crash test (#10252) · f322f273

由 Andrew Kryczka 提交于 6月 24, 2022

Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252

Reviewed By: riversand963

Differential Revision: D37432948

Pulled By: ajkr

fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a

f322f273

Pass rate_limiter_priority through filter block reader functions to FS (#10251) · 8e63d90f

由 Bo Wang 提交于 6月 24, 2022

Summary:
With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251

Test Plan: Current unit tests should pass.

Reviewed By: pdillinger

Differential Revision: D37427667

Pulled By: gitbw95

fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433

8e63d90f

Fix the flaky cursor persist test (#10250) · 410ca2ef

由 zczhu 提交于 6月 24, 2022

Summary:
The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250

Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test`

Reviewed By: jay-zhuang, riversand963

Differential Revision: D37426091

Pulled By: littlepig2013

fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611

410ca2ef

Reduce overhead of SortFileByOverlappingRatio() (#10161) · 246d4697

由 sdong 提交于 6月 24, 2022

Summary:
Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161

Test Plan:
Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:

TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000

The throughput improved by 11%.

Reviewed By: jay-zhuang

Differential Revision: D37129469

fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d

246d4697

24 6月, 2022 9 次提交

BlobDB in crash test hitting assertion (#10249) · 052666ae

由 Gang Liao 提交于 6月 23, 2022

Summary:
This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249

Reviewed By: ltamasi

Differential Revision: D37407576

Pulled By: gangliao

fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69

052666ae

Fix race condition between file purge and backup/checkpoint (#10187) · 725df120

由 Yanqin Jin 提交于 6月 23, 2022

Summary:
Resolves https://github.com/facebook/rocksdb/issues/10129

I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.

There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time   thread 1                                thread 2
  |  mutex_.lock
  |  read disable_delete_obsolete_files_
  |  ...
  |  wait on log_sync_cv and release mutex_
  |                                          mutex_.lock
  |                                          ++disable_delete_obsolete_files_
  |                                          mutex_.unlock
  |                                          mutex_.lock
  |                                          while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
  |                                          wake up with mutex_ locked
  |                                          compute WALs tracked by MANIFEST
  |                                          mutex_.unlock
  |  wake up with mutex_ locked
  |  ++pending_purge_obsolete_files_
  |  mutex_.unlock
  |
  |  delete obsolete WAL
  |                                          WAL missing but tracked in MANIFEST.
  V
```

The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37214235

Pulled By: riversand963

fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184

725df120

Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215) · 60619057

由 Mark Callaghan 提交于 6月 23, 2022

Summary:
This provides two things:
1) Runs a sequence of db_bench tests. This sequence was chosen to provide
good coverage with less variance.
2) Makes it easier to do A/B testing for multiple binaries. This combines
the report.tsv files into summary.tsv to make it easier to compare results
across multiple binaries.

Example output for 2) is:

ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id
1115171 446.7 9GB 8.9 1.0 454.7 26 26 0 0 0.9 0.5 2 7 51 5547 20 0.0 0 0.1 0.1 0.2 fillseq.wal_disabled.v400 2022-04-12T08:53:51 6.0
1045726 418.9 8GB 0.0GB 8.4 1.0 432.4 27 26 0 0 1.0 0.5 2 6 102 5618 20 0.0 0 0.1 0.0 0.1 fillseq.wal_disabled.v400 2022-04-12T12:25:36 6.28

ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id
2969192 1189.3 16GB 0.0 0.0 0 0 0 0 10.8 9.3 25 33 49 13551 1781 0.0 0 48.2 6.8 16.8 readrandom.t32 2022-04-12T08:54:28 6.0
2692922 1078.6 16GB 0.0GB 0.0 0.0 0 0 0 0 11.9 10.2 30 38 56 49735 1781 0.0 0 47.8 6.7 16.8 readrandom.t32 2022-04-12T12:26:15 6.28

...

ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id
180227 72.2 38GB 1126.4 8.7 643.2 3286 3218 0 0 177.6 50.2 2687 4083 6148 854083 1793 68.4 7804 17.0 5.9 0.5 overwrite.t32.s0 2022-04-12T11:55:21 6.0
236512 94.7 31GB 0.0GB 1502.9 8.9 862.2 5242 5125 0 0 135.3 59.9 2537 3268 5404 18545 1785 49.7 5112 25.5 8.0 9.4 overwrite.t32.s0 2022-04-12T15:27:25 6.28

Example output with formatting preserved is here:
https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215

Test Plan: run it

Reviewed By: jay-zhuang

Differential Revision: D37299892

Pulled By: mdcallag

fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8

60619057

Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175) · 2a3792ed

由 Yueh-Hsuan Chiang 提交于 6月 23, 2022

Summary:
Add suggest_compact_range() and suggest_compact_range_cf() to C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175

Test Plan:
As verifying the result requires SyncPoint, which is not available in the c_test.c,
the test is currently done by invoking the functions and making sure it does not crash.

Reviewed By: jay-zhuang

Differential Revision: D37305191

Pulled By: ajkr

fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db

2a3792ed

Cut output files at compaction cursors (#10227) · 17a1d65e

由 zczhu 提交于 6月 23, 2022

Summary:
The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227

Test Plan:
Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`

Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)

Reviewed By: jay-zhuang

Differential Revision: D37388088

Pulled By: littlepig2013

fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d

17a1d65e

Read from blob cache first when MultiGetBlob() (#10225) · ba1f62dd

由 Gang Liao 提交于 6月 23, 2022

Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225

Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Reviewed By: ltamasi

Differential Revision: D37358364

Pulled By: gangliao

fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6

ba1f62dd

Fix key size in cache_bench (#10234) · b52620ab

由 Guido Tagliavini Ponce 提交于 6月 23, 2022

Summary:
cache_bench wasn't generating 16B keys, which are necessary for FastLRUCache. Also:
- Added asserts in cache_bench, which is assuming that inserts never fail. When they fail (for example, if we used keys of the wrong size), memory allocated to the values will becomes leaked, and eventually the program crashes.
- Move kCacheKeySize to the right spot.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10234

Test Plan:
``make -j24 check``. Also, run cache_bench with FastLRUCache and check that memory usage doesn't blow up:
``./cache_bench -cache_type=fast_lru_cache -num_shard_bits=6 -skewed=true \
                        -lookup_insert_percent=100 -lookup_percent=0 -insert_percent=0 -erase_percent=0 \
                        -populate_cache=true -cache_size=1073741824 -ops_per_thread=10000000 \
                        -value_bytes=8192 -resident_ratio=1 -threads=16``

Reviewed By: pdillinger

Differential Revision: D37382949

Pulled By: guidotag

fbshipit-source-id: b697a942ebb215de5d341f98dc8566763436ba9b

b52620ab

Don't count no prefix as Bloom hit (#10244) · f81ea75d

由 Peter Dillinger 提交于 6月 23, 2022

Summary:
When a key is "out of domain" for the prefix_extractor (no
prefix assigned) then the Bloom filter is not queried. PerfContext
was counting this as a Bloom "hit" while Statistics doesn't count this
as a prefix Bloom checked. I think it's more accurate to call it neither
hit nor miss, so changing the counting to make it PerfContext coounting
more like Statistics.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244

Test Plan:
tests updates and expanded (Get and MultiGet). Iterator test
coverage of the change will come in next PR

Reviewed By: bjlemaire

Differential Revision: D37371297

Pulled By: pdillinger

fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157

f81ea75d

Dynamically changeable `MemPurge` option (#10011) · 5879053f

由 Baptiste Lemaire 提交于 6月 23, 2022

Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.

**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.

**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.

**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011

Reviewed By: pdillinger

Differential Revision: D36462357

Pulled By: bjlemaire

fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802

5879053f

23 6月, 2022 4 次提交

Add the blob cache to the stress tests and the benchmarking tool (#10202) · 2352e2df

由 Gang Liao 提交于 6月 22, 2022

Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202

Reviewed By: ltamasi

Differential Revision: D37325739

Pulled By: gangliao

fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8

2352e2df

Fix typo in comments and code (#10233) · c073ed76

由 Bo Wang 提交于 6月 22, 2022

Summary:
Fix typo in comments and code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233

Test Plan: Existing unit tests should pass.

Reviewed By: jay-zhuang, anand1976

Differential Revision: D37356702

Pulled By: gitbw95

fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51

c073ed76

Add get_column_family_metadata() and related functions to C API (#10207) · e103b872

由 Yueh-Hsuan Chiang 提交于 6月 22, 2022

Summary:
* Add metadata related structs and functions in C API, including
  - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
     that returns `rocksdb_column_family_metadata_t`.
  - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
  - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
  - `rocksdb_file_metadata_t` and its and get functions & destroy functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207

Test Plan:
Extend the existing c_test.c to include additional checks for column_family_metadata
inside CheckCompaction.

Reviewed By: riversand963

Differential Revision: D37305209

Pulled By: ajkr

fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e

e103b872

Adapt benchmark result script to new fields. (#10120) · a16e2ff8

由 Alan Paxton 提交于 6月 22, 2022

Summary:
Recently merged CI benchmark scripts were failing.

There has clearly been a major revision of the fields of benchmark output. The upload script expects and sanity-checks the existence of some fields (changes date to conform to OpenSearch format)..., so the script needs to change.

Also add a bit more exception checking to make it more obvious when this happens again.

We have deleted the existing report.tsv from the benchmark machine. An existing report.tsv is appended to by default, so that if the fields change, later rows no longer match the header. This makes for an upload that dies half way through the report file, when the format no longer matches the header.

Re-instate the config.yml for running the benchmarks, so we can once again test it in situ.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10120

Reviewed By: pdillinger

Differential Revision: D37314908

Pulled By: jay-zhuang

fbshipit-source-id: 34f5243fee694b75c6838eb55d3398e4273254b2

a16e2ff8

22 6月, 2022 8 次提交

Continue to deflake BackupEngineTest.Concurrency (#10228) · 36fefd7e

由 Yanqin Jin 提交于 6月 22, 2022

Summary:
Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure.

Repro steps as follows
```bash
make backup_engine_test
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which
`Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()`
allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses
`RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`.

This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()`
does not exist. We should make the behaviors of the test Env/FS match a real Env/FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228

Test Plan:
```bash
make check
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

Reviewed By: pdillinger

Differential Revision: D37337241

Pulled By: riversand963

fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d

36fefd7e

Expose the initial logger creation error (#10223) · 9586dcf1

由 Yanqin Jin 提交于 6月 22, 2022

Summary:
https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
`DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
directly expose the error to caller without some additional work.
This is a first version proposal which:
- Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
- Checks the error during `DB::Open()` and return it to caller if non-ok

This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
in case other callers of `SanitizeOptions()` assumes that a logger should be created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D37321717

Pulled By: riversand963

fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b

9586dcf1

Update API comment about Options::best_efforts_recovery (#10180) · 42c631b3

由 Yanqin Jin 提交于 6月 21, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10180

Reviewed By: pdillinger

Differential Revision: D37182037

Pulled By: riversand963

fbshipit-source-id: a8dc865b86e2249beb7a543c317e94a14781e910

42c631b3

Add data block hash index to crash test, fix MultiGet issue (#10220) · 84210c94

由 Peter Dillinger 提交于 6月 21, 2022

Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.

Fixes https://github.com/facebook/rocksdb/issues/10186

I intend to pick this into the 7.4.0 release candidate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220

Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.

Reviewed By: anand1976

Differential Revision: D37315647

Pulled By: pdillinger

fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba

84210c94

Refactor wal filter processing during recovery (#10214) · d654888b

由 Yanqin Jin 提交于 6月 21, 2022

Summary:
So that DBImpl::RecoverLogFiles do not have to deal with implementation
details of WalFilter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D37299122

Pulled By: riversand963

fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816

d654888b

Update LZ4 library for platform009 (#10224) · f7605ec6

由 Bo Wang 提交于 6月 21, 2022

Summary:
Update LZ4 library for platform009.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10224

Test Plan: Current unit tests should pass.

Reviewed By: anand1976

Differential Revision: D37321801

Pulled By: gitbw95

fbshipit-source-id: 8a3d3019d9f7478ac737176f2d2f443c0159829e

f7605ec6

Add basic kRoundRobin compaction policy (#10107) · 30141461

由 zczhu 提交于 6月 21, 2022

Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.

- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107

Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`

Reviewed By: ajkr

Differential Revision: D37316037

Pulled By: littlepig2013

fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee

30141461

Destroy iniital db dir for a test in DBWALTest (#10221) · b012d235

由 Yanqin Jin 提交于 6月 21, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221

Reviewed By: hx235

Differential Revision: D37316280

Pulled By: riversand963

fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572

b012d235

21 6月, 2022 6 次提交

Replace per-shard chained hash tables with open-addressing scheme (#10194) · 3afed740

由 Guido Tagliavini Ponce 提交于 6月 21, 2022

Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.

Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37296770

Pulled By: guidotag

fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad

3afed740

Add blob source to retrieve blobs in RocksDB (#10198) · deff48bc

由 Gang Liao 提交于 6月 20, 2022

Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.

Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198

Reviewed By: ltamasi

Differential Revision: D37294735

Pulled By: gangliao

fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e

deff48bc

Reduce a duplicate consistency check when applying a new version (#10169) · 4207872f

由 sdong 提交于 6月 20, 2022

Summary:
One consistency check in SaveTo() is dupilcated with the one within Apply(). Remove one of then in release mode to reduce time spent in DB mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10169

Test Plan: Run existing tests and see nothing breaks.

Reviewed By: ltamasi

Differential Revision: D37157821

fbshipit-source-id: 73b89443a20b43362ff66d10b9212022034a8234

4207872f

Add new value value type for wide-column entities (#10211) · 8f59c41c

由 Levi Tamasi 提交于 6月 20, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10211

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37294067

Pulled By: ltamasi

fbshipit-source-id: 3b26f1964746ba4e3654579cb07cd975a29c7319

8f59c41c

Fix bad include (#10213) · 50154357

由 Peter Dillinger 提交于 6月 20, 2022

Summary:
include "include/rocksdb/blah.h" is messing up some internal
builds vs. include "rocksdb/blah." This fixes the bad case and adds a
check for future instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213

Test Plan: back-port to 7.4 release candidate and watch internal build

Reviewed By: hx235

Differential Revision: D37296202

Pulled By: pdillinger

fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185

50154357

Add 7.4 to format compatibility test (#10209) · ccb4f047

由 Peter Dillinger 提交于 6月 20, 2022

Summary:
Forgotten in https://github.com/facebook/rocksdb/issues/10204

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10209

Test Plan: local run with SHORT_TEST=1

Reviewed By: hx235

Differential Revision: D37284028

Pulled By: pdillinger

fbshipit-source-id: 631c1969906d002acc930662dcd5eefc0c758429

ccb4f047

20 6月, 2022 2 次提交

Start release 7.5 development (#10204) · 6358e1b9

由 Peter Dillinger 提交于 6月 20, 2022

Summary:
Update HISTORY.md and version.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10204

Test Plan: version bump only

Reviewed By: ajkr

Differential Revision: D37271866

Pulled By: pdillinger

fbshipit-source-id: 0ccaa2af36648a5b6017c172a7826a244e1aec93

6358e1b9

Update HISTORY for 7.4.0 release freeze (#10196) · fac7a236

由 Peter Dillinger 提交于 6月 19, 2022

Summary:
Planned for Sunday 6/19

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10196

Test Plan: no code

Reviewed By: akankshamahajan15

Differential Revision: D37244857

Pulled By: pdillinger

fbshipit-source-id: afbf4aa201983b3c01c16b5f55c68f2325d17421

fac7a236

19 6月, 2022 1 次提交

Fix a bug in WriteBatchInternal::Append when write batch KV protection is turned on (#10201) · 0e0a1983

由 Changyu Bi 提交于 6月 18, 2022

Summary:
This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](https://github.com/facebook/rocksdb/blob/d5d8920f2cfd06d1803b0976acbe8b564b88b6b1/db/write_batch.cc#L2887), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201

Test Plan:
```
./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1
```

Reviewed By: ajkr

Differential Revision: D37260799

Pulled By: cbi42

fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea

0e0a1983

18 6月, 2022 4 次提交

Fix race condition with WAL tracking and `FlushWAL(true /* sync */)` (#10185) · d5d8920f

由 Andrew Kryczka 提交于 6月 17, 2022

Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.

The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.

An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.

```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185

Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"

Reviewed By: riversand963

Differential Revision: D37200993

Pulled By: ajkr

fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f

d5d8920f

Add rate-limiting support to batched MultiGet() (#10159) · a5d773e0

由 Hui Xiao 提交于 6月 17, 2022

Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.

However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159

Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
   - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work

Reviewed By: ajkr, anand1976

Differential Revision: D37135172

Pulled By: hx235

fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd

a5d773e0

Read blob from blob cache if exists when GetBlob() (#10178) · c965c9ef

由 Gang Liao 提交于 6月 17, 2022

Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178

Reviewed By: ltamasi

Differential Revision: D37250507

Pulled By: gangliao

fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b

c965c9ef

Use optimized folly DistributedMutex in LRUCache when available (#10179) · 1aac8145

由 Peter Dillinger 提交于 6月 17, 2022

Summary:
folly DistributedMutex is faster than standard mutexes though
imposes some static obligations on usage. See
https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
for details. Here we use this alternative for our Cache implementations
(especially LRUCache) for better locking performance, when RocksDB is
compiled with folly.

Also added information about which distributed mutex implementation is
being used to cache_bench output and to DB LOG.

Intended follow-up:
* Use DMutex in more places, perhaps improving API to support non-scoped
locking
* Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)

Credit: Thanks Siying for reminding me about this line of work that was previously
left unfinished.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179

Test Plan:
for correctness, existing tests. CircleCI config updated.
Also Meta-internal buck build updated.

For performance, ran simultaneous before & after cache_bench. Out of three
comparison runs, the middle improvement to ops/sec was +21%:

Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
compiler)

```
Complete in 20.201 s; Rough parallel ops/sec = 1584062
Thread ops/sec = 107176

Operation latency (ns):
Count: 32000000 Average: 9257.9421  StdDev: 122412.04
Min: 134  Median: 3623.0493  Max: 56918500
Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
```

New: (add USE_FOLLY=1)

```
Complete in 16.674 s; Rough parallel ops/sec = 1919135  (+21%)
Thread ops/sec = 135487

Operation latency (ns):
Count: 32000000 Average: 7304.9294  StdDev: 108530.28
Min: 132  Median: 3777.6012  Max: 91030902
Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
```

Reviewed By: anand1976

Differential Revision: D37182983

Pulled By: pdillinger

fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1

1aac8145

kvdb / rocksdb 10 个月 前同步成功

kvdb / rocksdb
10 个月前同步成功