提交 · 11ebddb1d4b58a22af3e4f6e6611de77114046ce · kvdb / rocksdb

23 5月, 2023 4 次提交

Add utils to use for handling user defined timestamp size record in WAL (#11451) · 11ebddb1

由 Yu Zhang 提交于 5月 22, 2023

Summary:
Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed.

Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451

Test Plan:
```
make check
./udt_util_test
```

Reviewed By: ltamasi

Differential Revision: D45894386

Pulled By: jowlyzhang

fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d

11ebddb1

Refactor WriteUnpreparedStressTest to be a unit test (#11424) · ffb5f1f4

由 Yu Zhang 提交于 5月 22, 2023

Summary:
This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424

Test Plan:
`./write_unprepared_transaction_test`
`./db_crashtest.py whitebox --txn`
`./db_crashtest.py blackbox --txn`

Reviewed By: hx235

Differential Revision: D45551521

Pulled By: jowlyzhang

fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8

ffb5f1f4

fix typo in detecting HAVE_AUXV_GETAUXVAL (#10913) · 5b945adf

由 FishAndBird 提交于 5月 22, 2023

Summary:
crc32 uses CPU heavily,  arm64 and ppc will benefited by crc32 accelerate.

Only build via `cmake` affected

- Arm64 Tested ok, crc32 acceralated, write 30GB data throughput promoted 30%.
- ppc not tested.
- x86_64 seems not affected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10913

Reviewed By: pdillinger

Differential Revision: D45959843

Pulled By: ajkr

fbshipit-source-id: 93c91f2702fec33cca69139a2544d7c5ebeac4c6

5b945adf

Repair/instate jemalloc build on M1 (#11257) · 6eb3770b

由 Alan Paxton 提交于 5月 22, 2023

Summary:
jemalloc was not building on M1 Macs. This makes it work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11257

Reviewed By: anand1976

Differential Revision: D45959570

Pulled By: ajkr

fbshipit-source-id: 08c2b81b399f5003a2c159d037f9bcc5d0059556

6eb3770b

20 5月, 2023 2 次提交

Update HISTORY.md/version.h/format compatiblity test for 8.3 release (#11464) · 509116c5

由 Yu Zhang 提交于 5月 19, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11464

Reviewed By: akankshamahajan15

Differential Revision: D46041333

Pulled By: jowlyzhang

fbshipit-source-id: 7d83cf9e611451fcc7f7e4a837681ed0d4271df4

509116c5

Much better stats for seeks and prefix filtering (#11460) · 39f5846e

由 Peter Dillinger 提交于 5月 19, 2023

Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.

This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
  * Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
  * We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
  * For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
  * The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460

Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.

Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).

Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec;   18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec;   19.1 (± 0.0) MB/sec

Reviewed By: ajkr

Differential Revision: D46029177

Pulled By: pdillinger

fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822

39f5846e

19 5月, 2023 4 次提交

Compatibility step for separating BlockCache and GeneralCache APIs (#11450) · 4067acab

由 Peter Dillinger 提交于 5月 18, 2023

Summary:
Add two type aliases for Cache: BlockCache and GeneralCache, and add LRUCacheOptions::MakeSharedGeneralCache(). This will ease upgrade to an intended future change to separate the cache API between block cache and other (general) uses, including row cache. Separating the APIs will make it easier to expose more details of block caching for customization. For example, it would be nice to pass the file unique ID and offset as the logical cache key instead of using a Slice, which could facilitate some file-specific customizations in block cache. This would also make it clear that HyperClockCache is not usable as a general cache, because it can only deal with fixed-size block cache keys.

block_cache, row_cache, and blob_cache are the uses of Cache in the public API. blob_cache should be able to use BlockCache while row_cache is a GeneralCache user, as its keys are of arbitrary size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11450

Test Plan: see updated unit test.

Reviewed By: anand1976

Differential Revision: D45882067

Pulled By: pdillinger

fbshipit-source-id: ff5d9f0b644f87ae337a29a7728ce3ed07b2a4b2

4067acab

Support Clip DB to KeyRange (#11379) · 8d8eb0e7

由 mayue.fight 提交于 5月 18, 2023

Summary:
This PR is part of the request https://github.com/facebook/rocksdb/issues/11317.
(Another part is https://github.com/facebook/rocksdb/pull/11378)

ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones).
 This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs.

When Calling ClipDB [begin, end), there are the following steps

1.  Quickly and directly delete files without overlap
 DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr)
2. Delete the Key outside the range
Delete[smallest_key, begin) + Delete[end, largest_key]
3. Delete the tombstone through Manul Compact
CompactRange(option, nullptr, nullptr)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379

Reviewed By: ajkr

Differential Revision: D45840358

Pulled By: cbi42

fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4

8d8eb0e7

Improve comment of ExpectedValue in db stress (#11456) · 7263f51d

由 Hui Xiao 提交于 5月 18, 2023

Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11424 made me realize there are a couple gaps in my `ExpectedValue` comments so I updated them, along with separating `ExpectedValue` into separate files so it's clearer that `ExpectedValue` can be used without updating `ExpectedState` (e.g, TestMultiGet() where we care about value base of expected value but not updating the ExpectedState).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11456

Test Plan: CI

Reviewed By: jowlyzhang

Differential Revision: D45965070

Pulled By: hx235

fbshipit-source-id: dcee690c13b00a3119757ea9d43b646f9644e1a9

7263f51d

Add `rocksdb.file.read.db.open.micros` (#11455) · 50046869

由 Hui Xiao 提交于 5月 18, 2023

Summary:
**Context/Summary:**
`rocksdb.file.read.db.open.micros` is left out in https://github.com/facebook/rocksdb/pull/11288

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11455

Test Plan:
- db bench
Setup: `./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`
Run:
`./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/  -benchmarks=readrandom  -key_size=3200 -value_size=512 -num=0 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none -file_checksum=1 -cache_size=1`

```
rocksdb.sst.read.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.read.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.read.db.open.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
```

Reviewed By: ajkr

Differential Revision: D45951934

Pulled By: hx235

fbshipit-source-id: 6c88639dc1b10d98ecccc963ce32a8800495f55b

50046869

18 5月, 2023 3 次提交

Minimal RocksJava compliance with Java 8 language level (EB 1046) (#10951) · e110d713

由 Alan Paxton 提交于 5月 17, 2023

Summary:
Apply a small (and automatic) set of IntelliJ Java inspections/repairs to the Java interface to RocksDB Java and its tests.
Partly enabled by the fact that we now (from RocksDB7) require java 8.

Explicit <p> in empty lines in javadoc comments.

Parameters and variables made final where possible.
Anonymous subclasses converted lambdas.

Some tests which previously used other assertion models were converted to assertj, e.g. (assertThat(actual).isEqualTo(expected)

In a very few cases tests were found to be inoperative or broken, and were repaired. No problems with actual RocksDB behaviour were observed.

This PR is intended to replace https://github.com/facebook/rocksdb/pull/9618 - that PR was not merged, and attempts to rebase it have yielded a questionable looking diff, so we choose to go back to square 1 here, and implement a conservative set of changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10951

Reviewed By: anand1976

Differential Revision: D45057849

Pulled By: ajkr

fbshipit-source-id: e4ea46bfc80518ae86f37702b03ca9352bc11c3d

e110d713

Remove wait_unscheduled from waitForCompact internal API (#11443) · 586d78b3

由 Jay Huh 提交于 5月 17, 2023

Summary:
Context:

In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.

In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.

This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.

Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.

Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443

Test Plan:
Existing tests that involve a shutdown in progress:

- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle

Reviewed By: pdillinger

Differential Revision: D45923426

Pulled By: jaykorean

fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae

586d78b3

Change internal headers with duplicate names (#11408) · 206fdea3

由 Peter Dillinger 提交于 5月 17, 2023

Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373

206fdea3

16 5月, 2023 2 次提交

Support parallel read and write/delete to same key in NonBatchedOpsStressTest (#11058) · 5fc57eec

由 Hui Xiao 提交于 5月 15, 2023

Summary:
**Context:**
Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage.

To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state.

**Summary:**
Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel.

Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted).

Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept.

(For detailed algorithm of how to use these parts to infer expected state of a key, see the PR)

Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058

Test Plan:
- Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min
```
python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000
```
- Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability)
- [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key

Reviewed By: ajkr

Differential Revision: D42257258

Pulled By: hx235

fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb

5fc57eec

Fix write stall stats dump format (#11445) · fb636f24

由 Andrew Kryczka 提交于 5月 15, 2023

Summary:
I noticed in https://github.com/facebook/rocksdb/issues/11426 there is a missing line break. Before this PR the output looked like

```
$ ./db_bench -stats_per_interval=1 -stats_interval=100000
...
Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0, Block cache LRUCache@0x7f8695831b50#2766536 capacity: 32.00 MB seed: 1155354975 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 0 last_secs: 9.3e-05 secs_since: 2
...
Write Stall (count): write-buffer-manager-limit-stops: 0, num-running-compactions: 0
...
```

After this PR it looks like

```
$ ./db_bench -stats_per_interval=1 -stats_interval=100000
...
Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0
Block cache LRUCache@0x7f8e0d231b50#2736585 capacity: 32.00 MB seed: 920433955 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 1 last_secs: 6.5e-05 secs_since: 4
...
Write Stall (count): write-buffer-manager-limit-stops: 0
num-running-compactions: 0
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11445

Reviewed By: hx235

Differential Revision: D45844752

Pulled By: ajkr

fbshipit-source-id: 1c708cb05b6e270922ac2fa95f5d011f273347eb

fb636f24

13 5月, 2023 2 次提交

Delete temp OPTIONS file on failure to write it (#11423) · 2084cdf2

由 anand76 提交于 5月 12, 2023

Summary:
When the DB is opened, RocksDB creates a temp OPTIONS file, writes the current options to it, and renames it. In case of a failure, the temp file is left behind, and is not deleted by PurgeObsoleteFiles(). Fix this by explicitly deleting the temp file if writing to it or renaming it fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11423

Test Plan: Add a unit test

Reviewed By: akankshamahajan15

Differential Revision: D45540454

Pulled By: anand1976

fbshipit-source-id: 47facdc30d8cc5667036312d04b21d3fc253c92e

2084cdf2

Add block checksum mismatch ticker stat (#11438) · 113f3250

由 Andrew Kryczka 提交于 5月 12, 2023

Summary:
Added a ticker stat, `BLOCK_CHECKSUM_MISMATCH_COUNT`, to count how many block checksum verifications detected a mismatch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11438

Test Plan: new unit test

Reviewed By: pdillinger

Differential Revision: D45788179

Pulled By: ajkr

fbshipit-source-id: e2b44eba7c23b3e110ebe69eaa78a710dec2590f

113f3250

12 5月, 2023 2 次提交

Add support in log writer and reader for a user-defined timestamp size record (#11433) · 47235dda

由 Yu Zhang 提交于 5月 11, 2023

Summary:
This patch adds support to write and read a user-defined timestamp size record in log writer and log reader. It will be used by WAL logs to persist the user-defined timestamp format for subsequent WriteBatch records. Reading and writing UDT sizes for WAL logs are not included in this patch. It will be in a follow up.

The syntax for the record is: at write time, one such record is added when log writer encountered any non-zero UDT size it hasn't recorded so far. At read time, all such records read up to a point are accumulated and applicable to all subsequent WriteBatch records.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11433

Test Plan:
```
make clean && make -j32 all
./log_test --gtest_filter="*WithTimestampSize*"
```

Reviewed By: ltamasi

Differential Revision: D45678708

Pulled By: jowlyzhang

fbshipit-source-id: b770c8f45bb7b9383b14aac9f22af781304fb41d

47235dda

Support compacting files to different temperatures in FIFO compaction (#11428) · 8827cd06

由 Changyu Bi 提交于 5月 11, 2023

Summary:
- Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable.
- Deprecate the old option `age_for_warm` that was added for a similar purpose.
- Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature.
- Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428

Test Plan: adapted unit tests that were for `age_for_warm` to this new option.

Reviewed By: ajkr

Differential Revision: D45611412

Pulled By: cbi42

fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106

8827cd06

11 5月, 2023 1 次提交

Clean up rate limiter refill logic (#11425) · 7531cbda

由 Jay Huh 提交于 5月 10, 2023

Summary:
Context:

This pull request update is in response to a comment made on https://github.com/facebook/rocksdb/pull/8596#discussion_r680264932. The current implementation of RefillBytesAndGrantRequestsLocked() drains all available_bytes, but the first request after the last wave of requesting/bytes granting is done is not being handled in the same way.

This creates a scenario where if a request for a large amount of bytes is enqueued first, but there are not enough available_bytes to fulfill it, the request is put to sleep until the next refill time. Meanwhile, a later request for a smaller number of bytes comes in and is granted immediately. This behavior is not fair as the earlier request was made first.

To address this issue, we have made changes to the code to exhaust the remaining available bytes from the request and queue the remaining. With this change, requests are granted in the order they are received, ensuring that earlier requests are not unfairly delayed by later, smaller requests. The specific scenario described above will no longer occur with this change. Also consolidated `granted` and `request_bytes` as part of the change since `granted` is equivalent to `request_bytes == 0`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11425

Test Plan: Added `AvailableByteSizeExhaustTest`

Reviewed By: hx235

Differential Revision: D45570711

Pulled By: jaykorean

fbshipit-source-id: a7117ed17bf4b8a7ae0f76124cb41902db1a2592

7531cbda

10 5月, 2023 3 次提交

Simplify detection of x86 CPU features (#11419) · 459969e9

由 Peter Dillinger 提交于 5月 09, 2023

Summary:
**Background** - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant.

What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512):

PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions)
PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2)
PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512)

Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases.

x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO.

**This change** does two related things:
* Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.)
* Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway.

**In more detail:**
* `HAVE_SSE42`, `HAVE_AVX2`, and `HAVE_PCLMUL` replaced by standard macros `__SSE4_2__`, `__AVX2__`, and `__PCLMUL__`.
* MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from `__AVX__` or `__AVX2__` in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue.
* `__POPCNT__` is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer.
* `PORTABLE` can be set to a CPU type, e.g. "haswell", to compile for that CPU type.
* `USE_SSE` is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior.

Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11419

Test Plan:
existing tests, CI

Manual performance tests after the change match the before above (none expected with make build).

We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)

Reviewed By: ajkr

Differential Revision: D45489041

Pulled By: pdillinger

fbshipit-source-id: 60ceb0dd2aa3b365c99ed08a8b2a087a9abb6a70

459969e9

Add hash_seed to Caches (#11391) · f4a02f2c

由 Peter Dillinger 提交于 5月 09, 2023

Summary:
See motivation and description in new ShardedCacheOptions::hash_seed option.

Updated db_bench so that its seed param is used for the cache hash seed.
Made its code more safe to ensure seed is set before use.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391

Test Plan:
unit tests added / updated

**Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache.

Reviewed By: hx235

Differential Revision: D45557797

Pulled By: pdillinger

fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934

f4a02f2c

Fix build error: variable 'base_level' may be uninitialized (#11435) · 6ba4717f

由 akankshamahajan 提交于 5月 09, 2023

Summary:
Fix build error: variable 'base_level' may be uninitialized
```
 db_impl_compaction_flush.cc:1195:21: error: variable 'base_level' may be uninitialized when used here [-Werror,-Wconditional-uninitialized]
            level = base_level;
```
                    ^~~~~~~~~~

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11435

Test Plan: CircleCI jobs

Reviewed By: cbi42

Differential Revision: D45708176

Pulled By: akankshamahajan15

fbshipit-source-id: 851b1205b22b63d728495e5735fa91b0ad8e012b

6ba4717f

09 5月, 2023 3 次提交

Record and use the tail size to prefetch table tail (#11406) · 8f763bde

由 Hui Xiao 提交于 5月 08, 2023

Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.

**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
   - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406

Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
 diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
 --- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);

   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
 diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
 --- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

Reviewed By: pdillinger

Differential Revision: D45413346

Pulled By: hx235

fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064

8f763bde

Organize + modernize ReadOptions (#11430) · e1d1c503

由 Peter Dillinger 提交于 5月 08, 2023

Summary:
Roughly group ReadOptions into those that apply generally and those that only apply to range scans. Also use field assignment idiom to simplify specification of default values.

Also some rearranging to reduce unused padding. sizeof(ReadOptions) was 144 on my system, now 136.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11430

Test Plan: existing tests, no functional change intended

Reviewed By: hx235

Differential Revision: D45626508

Pulled By: pdillinger

fbshipit-source-id: 227d4158c5123405324f273ded2eb9d8bce86364

e1d1c503

Added encryption plugin based on Intel open-source ipp-crypto library (#11429) · 736b3c49

由 Akanksha koul 提交于 5月 08, 2023

Summary:
This PR adds a plugin that supports AES-CTR encryption for RocksDB based on highly performant intel open-source cryptographic library IPP-Crypto.

Details:
- supports AES-128, AES-192, and AES-256.
- uses the CTR mode of operation.
- based on the Intel® crypto library -- https://github.com/intel/ipp-crypto.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11429

Reviewed By: cbi42

Differential Revision: D45622342

Pulled By: ajkr

fbshipit-source-id: 2463fa2b8ae625fdd7d83768e274c74e3f2a0f46

736b3c49

05 5月, 2023 1 次提交

Clarify io_activity (#11427) · a5909f88

由 Peter Dillinger 提交于 5月 04, 2023

Summary:
Document ReadOptions::io_activity as internal-use-only. And to keep kUnknown as last (and why).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11427

Test Plan: comments only

Reviewed By: hx235

Differential Revision: D45576986

Pulled By: pdillinger

fbshipit-source-id: aae15aa22ea91370c2b7366154e45d4b91a79ad2

a5909f88

04 5月, 2023 3 次提交

Fix flaky test `DBTestUniversalManualCompactionOutputPathId.ManualCompactionOutputPathId` (#11412) · a11f1e12

由 Changyu Bi 提交于 5月 03, 2023

Summary:
the test is flaky when compiled with `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`. The cause is that a manual compaction `CompactRange()` can finish and return before obsolete files are deleted. One reason for this is that a manual compaction waits until `manual.done` is set here https://github.com/facebook/rocksdb/blob/62fc15f009eba86e65f2f7448829429eae9ad071/db/db_impl/db_impl_compaction_flush.cc#L1978
and the compaction thread can set `manual.done`:
https://github.com/facebook/rocksdb/blob/62fc15f009eba86e65f2f7448829429eae9ad071/db/db_impl/db_impl_compaction_flush.cc#L3672
and then temporarily release mutex_:
https://github.com/facebook/rocksdb/blob/62fc15f009eba86e65f2f7448829429eae9ad071/db/db_impl/db_impl_files.cc#L317
before purging obsolete files:
https://github.com/facebook/rocksdb/blob/62fc15f009eba86e65f2f7448829429eae9ad071/db/db_impl/db_impl_compaction_flush.cc#L3144

With `COERCE_CONTEXT_SWITCH=1`, `bg_cv_.SignalAll()` is called during `mutex_.Lock()`, so the manual compaction thread can wake up and return before obsolete files are deleted. Updated the test to only count live SST files.

Also updated `FindObsoleteFiles()` to avoid locking a locked mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11412

Test Plan: `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`

Reviewed By: ajkr

Differential Revision: D45342242

Pulled By: cbi42

fbshipit-source-id: 955c9796aa3f484e3557d300f97cffacb3ed9b0c

a11f1e12

Add more users (#11369) · c81d5801

由 nccx 提交于 5月 03, 2023

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11369

Reviewed By: akankshamahajan15

Differential Revision: D44891388

Pulled By: ajkr

fbshipit-source-id: cc426bc1359f0721a9002652e99d1317ff7b6383

c81d5801

remove redundant move (#11418) · 50b33ebb

由 clundro 提交于 5月 03, 2023

Summary:
when I use g++-13 to exec the `make all` command,  the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```

and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418

Reviewed By: akankshamahajan15

Differential Revision: D45528223

Pulled By: ajkr

fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc

50b33ebb

03 5月, 2023 2 次提交

DBIter::FindValueForCurrentKey: remove unused Status s (#11394) · a475e9f7

由 leipeng 提交于 5月 03, 2023

Summary:
This PR remove a historical useless code

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11394

Reviewed By: ajkr

Differential Revision: D45506226

Pulled By: akankshamahajan15

fbshipit-source-id: 32c98627100c9ad131bf65c4a1fe97ab61502daf

a475e9f7

Delete empty WAL files on reopen (#11409) · 03a892a9

由 anand76 提交于 5月 02, 2023

Summary:
When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409

Test Plan: Add a unit test

Reviewed By: ajkr

Differential Revision: D45281685

Pulled By: anand1976

fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e

03a892a9

02 5月, 2023 4 次提交

Avoid long parameter lists configuring Caches (#11386) · 41a7fbf7

由 Peter Dillinger 提交于 5月 01, 2023

Summary:
For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr.

Net reduction of about 70 source code lines (including comments).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386

Test Plan: existing tests

Reviewed By: ajkr

Differential Revision: D45059075

Pulled By: pdillinger

fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a

41a7fbf7

Optionally support lldb for stack traces and debugger attach (#11413) · e0e318f3

由 Peter Dillinger 提交于 5月 01, 2023

Summary:
lldb is more supported for Meta infrastructure than gdb, so adding support for it in generating stack traces and attaching debugger on crash. For now you need to set ROCKSDB_LLDB_STACK=1 for stack traces or ROCKSDB_DEBUG=lldb for interactive debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11413

Test Plan: some manual testing (no production code changes)

Reviewed By: ajkr

Differential Revision: D45360952

Pulled By: pdillinger

fbshipit-source-id: 862bc8800eb03e3bdc1be8c0702960a19db45be8

e0e318f3

Fix duplicate symbols in linking with buck2 (#11421) · 76a40286

由 Peter Dillinger 提交于 5月 01, 2023

Summary:
Seen in Meta-internal builds that manually depend on rocksdb_whole_archive_lib and want to automatically depend on rocksdb_lib. This change puts rocksdb_lib in the ancestry of rocksdb_whole_archive_lib, and buck2 appears to recognize that even if rocksdb_lib is listed as a separate dependency downstream.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11421

Test Plan: Run failing internal build with the change. See T147085939

Reviewed By: akankshamahajan15

Differential Revision: D45446689

Pulled By: pdillinger

fbshipit-source-id: e8a891fa020dfcf0564b35d30511d70347650fa8

76a40286

Shard JemallocNodumpAllocator (#11400) · 925d8252

由 Andrew Kryczka 提交于 5月 01, 2023

Summary:
RocksDB's jemalloc no-dump allocator (`NewJemallocNodumpAllocator()`) was using a single manual arena. This arena's lock contention could be very high when thread caching is disabled for RocksDB blocks (e.g., when using `MALLOC_CONF='tcache_max:4096'` and `rocksdb_block_size=16384`).

This PR changes the jemalloc no-dump allocator to use a configurable number of manual arenas. That number is required to be a power of two so we can avoid division. The allocator shards allocation requests randomly across those manual arenas.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11400

Test Plan:
- mysqld setup
  - Branch: fb-mysql-8.0.28 (https://github.com/facebook/mysql-5.6/commit/653eba2e56cfba4eac0c851ac9a70b2da9607527)
  - Build: `mysqlbuild.sh --clean --release`
  - Set env var `MALLOC_CONF='tcache_max:$tcache_max'`
  - Added CLI args `--rocksdb_cache_dump=false --rocksdb_block_cache_size=4294967296 --rocksdb_block_size=16384`
  - Ran under /usr/bin/time
- Large database scenario
  - Setup command: `mysqlslap -h 127.0.0.1 -P 13020 --auto-generate-sql=1 --auto-generate-sql-load-type=write --auto-generate-sql-guid-primary=1 --number-char-cols=8 --auto-generate-sql-execute-number=262144 --concurrency=32 --no-drop`
  - Benchmark command: `mysqlslap -h 127.0.0.1 -P 13020 --query='select count(*) from mysqlslap.t1;' --number-of-queries=320 --concurrency=32`
  - Results:

| tcache_max | num_arenas | Peak RSS MB (% change) | Query latency seconds (% change) |
|---|---|---|---|
| 4096 | **(baseline)** | 4541 | 37.1 |
| 4096 | 1 | 4535 (-0.1%) | 36.7 (-1%) |
| 4096 | 8 | 4687 (+3%) | 10.2 (-73%) |
| 16384 | **(baseline)** | 4514 | 8.4 |
| 16384 | 1 | 4526 (+0.3%) | 8.5 (+1%) |
| 16384 | 8 | 4580 (+1%) | 8.5 (+1%) |

Reviewed By: pdillinger

Differential Revision: D45220794

Pulled By: ajkr

fbshipit-source-id: 9a50c9872bdef5d299e52b115a65ee8a5557d58d

925d8252

29 4月, 2023 1 次提交

Deflake some old BlobDB test cases (#11417) · d3ed7968

由 Levi Tamasi 提交于 4月 28, 2023

Summary:
The old `StackableDB` based BlobDB implementation relies on a DB listener to track the total size of the SST files in the database and to trigger FIFO eviction. Some test cases in `BlobDBTest` assume that the listener is notified by the time `DB::Flush` returns, which is not guaranteed (side note: `TEST_WaitForFlushMemTable` would not guarantee this either). The patch fixes these tests by using `SyncPoint`s to make sure the listener is actually called before verifying the FIFO behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11417

Test Plan:
```
make -j56 COERCE_CONTEXT_SWITCH=1 blob_db_test
./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_TriggerOnSSTSizeChange
./blob_db_test --gtest_filter=BlobDBTest.FilterForFIFOEviction
./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_NoEnoughBlobFilesToEvict
```

Reviewed By: ajkr

Differential Revision: D45407135

Pulled By: ltamasi

fbshipit-source-id: fcd63d76937d2c975f569a6635ce8730772a3d75

d3ed7968

26 4月, 2023 2 次提交

Block per key-value checksum (#11287) · 62fc15f0

由 Changyu Bi 提交于 4月 25, 2023

Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates

Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`

Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.

Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):

```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none

BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE

The readrandom ops/sec looks like the following:
Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
Main              240805     223604         198176           161653       139040
PR prot_bytes=0   238691     226693         200127           161082       141153
PR prot_bytes=1   214983     193199         178532           137013       108211
prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
prot_bytes=0
```

The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287

Reviewed By: ajkr

Differential Revision: D43970708

Pulled By: cbi42

fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940

62fc15f0

DBImpl::MultiGet: delete unused var `superversions_to_delete` (#11395) · 40d69b59

由 leipeng 提交于 4月 25, 2023

Summary:
In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395

Reviewed By: ajkr

Differential Revision: D45240896

Pulled By: cbi42

fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31

40d69b59

25 4月, 2023 1 次提交

Add back io_uring stress test hack with DbStressFSWrapper for FS not supporting read async (#11404) · 3622cfa3

由 Hui Xiao 提交于 4月 24, 2023

Summary:
**Context/Summary:**
To better utilize `DbStressFSWrapper` for some assertion, https://github.com/facebook/rocksdb/pull/11288 removed an io_uring stress test hack for POSIX FS not supporting read async added in https://github.com/facebook/rocksdb/pull/11242 = It was removed based on the assumption that a later PR https://github.com/facebook/rocksdb/pull/11296 is sufficient to serve as an alternative workaround.

But recent stress tests has shown the opposite, mostly because 11296  approach might be subjected to incompleteness when more `ReadOptions` are passed down as what https://github.com/facebook/rocksdb/pull/11288 has done.

As a short-term solution to both work around POSIX FS constraint above and utilize `DbStressFSWrapper` for 11288 assertion, I proposed this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11404

Test Plan:
- Stress test ensures 11288's assertion is still effective in `DbStressFSWrapper`
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=disable --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=32767 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=$exp --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=1 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
- Monitor future stress test to show `MultiGet error: Not implemented: ReadAsync` is gone

Reviewed By: ltamasi

Differential Revision: D45242280

Pulled By: hx235

fbshipit-source-id: 9823e3fbd4e9672efdd31478a2f2cbd68a98bdf5

3622cfa3

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功