提交 · e267909ecfe92e79232f7700e41b6e12c4a87c01 · kvdb / rocksdb

19 10月, 2022 1 次提交

Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348) · e267909e

由 Yueh-Hsuan Chiang 提交于 10月 18, 2022

Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.

This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels.  Below is the read / write / compaction behavior:

* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well.  When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended.  It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348

Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.

Reviewed By: jay-zhuang

Differential Revision: D40233744

fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1

e267909e

18 10月, 2022 3 次提交

Print stack traces on frozen tests in CI (#10828) · e466173d

由 Peter Dillinger 提交于 10月 18, 2022

Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1

e466173d

Improve / refactor anonymous mmap capabilities (#10810) · 8367f0d2

由 Peter Dillinger 提交于 10月 17, 2022

Summary:
The motivation for this change is a planned feature (related to HyperClockCache) that will depend on a large array that can essentially grow automatically, up to some bound, without the pointer address changing and with guaranteed zero-initialization of the data. Anonymous mmaps provide such functionality, and this change provides an internal API for that.

The other existing use of anonymous mmap in RocksDB is for allocating in huge pages. That code and other related Arena code used some awkward non-RAII and pre-C++11 idioms, so I cleaned up much of that as well, with RAII, move semantics, constexpr, etc.

More specifcs:
* Minimize conditional compilation
* Add Windows support for anonymous mmaps
* Use std::deque instead of std::vector for more efficient bag

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10810

Test Plan: unit test added for new functionality

Reviewed By: riversand963

Differential Revision: D40347204

Pulled By: pdillinger

fbshipit-source-id: ca83fcc47e50fabf7595069380edd2954f4f879c

8367f0d2

Do not adjust test_batches_snapshots to avoid mixing runs (#10830) · 11c0d131

由 Levi Tamasi 提交于 10月 17, 2022

Summary:
This is a small follow-up to https://github.com/facebook/rocksdb/pull/10821. The goal of that PR was to hold `test_batches_snapshots` fixed across all `db_stress` invocations; however, that patch didn't address the case when `test_batches_snapshots` is unset due to a conflicting `enable_compaction_filter` or `prefix_size` setting. This PR updates the logic so the other parameter is sanitized instead in the case of such conflicts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10830

Reviewed By: riversand963

Differential Revision: D40444548

Pulled By: ltamasi

fbshipit-source-id: 0331265704904b729262adec37139292fcbb7805

11c0d131

17 10月, 2022 2 次提交

Git ignore .clangd/ (#10817) · 8142223b

由 Peter Dillinger 提交于 10月 17, 2022

Summary:
Used for IDE integration

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10817

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D40348563

Pulled By: pdillinger

fbshipit-source-id: ae2151017de7df6afc55363276105a7dac53683c

8142223b

Enable preclude_last_level_data_seconds in stress test (#10824) · 8124bc35

由 Jay Zhuang 提交于 10月 16, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10824

Reviewed By: siying

Differential Revision: D40390535

Pulled By: jay-zhuang

fbshipit-source-id: 700803a1aff8a1e77c038740d87931577e79bcf6

8124bc35

15 10月, 2022 1 次提交

Check wide columns in TestIterateAgainstExpected (#10820) · 2f3042d7

由 Levi Tamasi 提交于 10月 14, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10820

Reviewed By: riversand963

Differential Revision: D40363653

Pulled By: ltamasi

fbshipit-source-id: d347547d8cdd3f8926b35b6af4d1fa0f827e4a10

2f3042d7

14 10月, 2022 4 次提交

Temporarily disable mixing batched and non-batched runs (#10821) · 3cd78bce

由 Levi Tamasi 提交于 10月 13, 2022

Summary:
We have recently made some stress test improvements that rely on decoding the "value base" from the values stored in the database. This logic does not currently support the case when some KVs are written by a non-batched ops run and some by a batched ops run. The patch temporarily disables mixing these two.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10821

Reviewed By: riversand963

Differential Revision: D40367326

Pulled By: ltamasi

fbshipit-source-id: 66f2e0cbc097ab6b1f9e4b39b833bd466f1aaab5

3cd78bce

Check wide columns in TestIterate (#10818) · eae3a686

由 Levi Tamasi 提交于 10月 13, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10818

Test Plan: Tested using some simple blackbox crash test runs in the various modes (non-batched, batched, CF consistency).

Reviewed By: riversand963

Differential Revision: D40349527

Pulled By: ltamasi

fbshipit-source-id: 2918bc26adbbeac314beaa958aafe770b01e5cc6

eae3a686

Deflake^2 DBBloomFilterTest.OptimizeFiltersForHits (#10816) · 1ee747d7

由 Peter Dillinger 提交于 10月 13, 2022

Summary:
This reverts https://github.com/facebook/rocksdb/issues/10792 and uses a different strategy to stabilize the test: remove the unnecessary randomness by providing a constant seed for shuffling keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10816

Test Plan: `gtest-parallel ./db_bloom_filter_test -r1000 --gtest_filter=*ForHits*`

Reviewed By: jay-zhuang

Differential Revision: D40347957

Pulled By: pdillinger

fbshipit-source-id: a270e157485cbd94ed03b80cdd21b954ebd57d57

1ee747d7

Fix file modes (#10815) · a2eea18f

由 Peter Dillinger 提交于 10月 13, 2022

Summary:
*.sh files need execute permission. Benchmark-linux failing in CircleCI due to https://github.com/facebook/rocksdb/issues/10803

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10815

Test Plan: CI

Reviewed By: ltamasi

Differential Revision: D40346922

Pulled By: pdillinger

fbshipit-source-id: 658f185b5d2e906ee50e1de1b12f27fa9968ba5d

a2eea18f

13 10月, 2022 2 次提交

Several small improvements (#10803) · 6ff0c204

由 Mark Callaghan 提交于 10月 12, 2022

Summary:
This has several small improvements.

benchmark.sh
* add BYTES_PER_SYNC as an env variable
* use --prepopulate_block_cache when O_DIRECT is used
* use --undefok to list options that don't work for all 7.x releases
* print "failure" in report.tsv when a benchmark fails
* parse the slightly different throughput line used by db_bench for multireadrandom
* remove the trailing comma for BlobDB size before printing it in report.tsv
* use the last line of the output from /bin/time as there can be more than one line when db_bench has a non-zero exit
* fix more bash lint warnings
* add ",stats" to the --benchmark=... lines to get stats at the end of each benchmark

benchmark_compare.sh
* run revrange immediately after fillseq to let compaction debt get removed
* add --multiread_batched when --benchmarks=multireadrandom is used
* use --benchmarks=overwriteandwait when supported to get a more accurate measure of write-amp

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10803

Test Plan: Run it for leveled, universal and BlobDB

Reviewed By: jay-zhuang

Differential Revision: D40278315

Pulled By: mdcallag

fbshipit-source-id: 793134ddc7d48d05a07436cd8942c375a23983a7

6ff0c204

Check columns in CfConsistencyStressTest::VerifyDb (#10804) · 23b7dc2f

由 Levi Tamasi 提交于 10月 12, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10804

Reviewed By: riversand963

Differential Revision: D40279057

Pulled By: ltamasi

fbshipit-source-id: 9efc3dae7f5eaab162d55a41c58c2535b0a53054

23b7dc2f

12 10月, 2022 1 次提交

Consider wide columns when checksumming in the stress tests (#10788) · 85399b14

由 Levi Tamasi 提交于 10月 11, 2022

Summary:
There are two places in the stress test code where we compute the CRC
for a range of KVs for the purposes of checking consistency, namely in the
CF consistency test (to make sure CFs contain the same data), and when
performing `CompactRange` (to make sure the pre- and post-compaction
states are equivalent). The patch extends the logic so that wide columns
are also considered in both cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10788

Test Plan: Tested using some simple blackbox crash test runs.

Reviewed By: riversand963

Differential Revision: D40191134

Pulled By: ltamasi

fbshipit-source-id: 542c21cac9077c6d225780deb210319bb5eee955

85399b14

11 10月, 2022 9 次提交

Allow the last level data moving up to penultimate level (#10782) · 5a5f21c4

由 Jay Zhuang 提交于 10月 10, 2022

Summary:
Lock the penultimate level for the whole compaction inputs range, so any
key in that compaction is safe to move up from the last level to
penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782

Reviewed By: siying

Differential Revision: D40231540

Pulled By: siying

fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e

5a5f21c4

Allow manifest fix-up without requiring prior state (#10796) · 2d0380ad

由 Peter Dillinger 提交于 10月 10, 2022

Summary:
This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.

Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796

Test Plan: tests added/updated, that fail before the change(s) and now pass

Reviewed By: jay-zhuang

Differential Revision: D40232645

Pulled By: jay-zhuang

fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be

2d0380ad

Allow Flush(sync=true) not supported in DB::Open() and db_stress (#10784) · f6a0065d

由 Hui Xiao 提交于 10月 10, 2022

Summary:
**Context:**
https://github.com/facebook/rocksdb/pull/10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see https://github.com/facebook/rocksdb/pull/10698 summary for more ) as well as db_stress to pass.

However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10784

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D40193354

Pulled By: anand1976

fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54

f6a0065d

Provide support for async_io with tailing iterators (#10781) · ebf8c454

由 akankshamahajan 提交于 10月 10, 2022

Summary:
Provide support for async_io if ReadOptions.tailing is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781

Test Plan:
- Update unit tests
- Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1

Reviewed By: anand1976

Differential Revision: D40128882

Pulled By: anand1976

fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01

ebf8c454

Skip column validation for non-value types when iter_start_ts is set (#10799) · 5182bf3f

由 Levi Tamasi 提交于 10月 10, 2022

Summary:
When the `iter_start_ts` read option is set, iterator exposes internal keys. This also includes tombstones, which by definition do not have a value (or columns). The patch makes sure we skip the wide-column consistency check in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10799

Test Plan: Tested using a simple blackbox crash test with timestamps enabled.

Reviewed By: jay-zhuang, riversand963

Differential Revision: D40235628

fbshipit-source-id: 49519fb55d8fe2bb9249ced809f7a81bff2b9df2

5182bf3f

Fix flaky test ShuttingDownNotBlockStalledWrites (#10800) · a6ce1955

由 Changyu Bi 提交于 10月 10, 2022

Summary:
DBTest::ShuttingDownNotBlockStalledWrites is flaky, added new sync point dependency to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10800

Test Plan: gtest-parallel --repeat=1000 ./db_test --gtest_filter="*ShuttingDownNotBlockStalledWrites"

Reviewed By: jay-zhuang

Differential Revision: D40239116

Pulled By: jay-zhuang

fbshipit-source-id: 8c2d7e7df58f202d287bd9f5c9b60b7eff270d0c

a6ce1955

Deflake DBBloomFilterTest.OptimizeFiltersForHits (#10792) · 62ba5c80

由 Jay Zhuang 提交于 10月 10, 2022

Summary:
The test may fail because the L5 files may only cover small portion of the whole key range.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10792

Test Plan:
```
gtest-parallel ./db_bloom_filter_test --gtest_filter=DBBloomFilterTest.OptimizeFiltersForHits -r 1000 -w 100
```

Reviewed By: siying

Differential Revision: D40217600

Pulled By: siying

fbshipit-source-id: 18db549184bccf5e513eaa7e31ab17385b71ef71

62ba5c80

Fix a few errors in async IO blog post (#10795) · fac7a31c

由 anand76 提交于 10月 10, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10795

Reviewed By: jay-zhuang, akankshamahajan15

Differential Revision: D40229329

fbshipit-source-id: 7ec5347e0a8a52f80a0a9cc2a0c17b094736d6d9

fac7a31c

fix issue 10751 (#10765) · a45e6878

由 Qingping Wang 提交于 10月 10, 2022

Summary:
Fix https://github.com/facebook/rocksdb/issues/10751 where a stalled write could be blocked forever when DB shutdown.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10765

Reviewed By: ajkr

Differential Revision: D40110069

Pulled By: ajkr

fbshipit-source-id: 598c05777db9be85913a0a85e421b3295ecdff5e

a45e6878

08 10月, 2022 4 次提交

Add option `preserve_internal_time_seconds` to preserve the time info (#10747) · c401f285

由 Jay Zhuang 提交于 10月 07, 2022

Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747

Reviewed By: siying

Differential Revision: D39910141

Pulled By: siying

fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138

c401f285

Blog post for asynchronous IO (#10789) · f366f90b

由 anand76 提交于 10月 07, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10789

Reviewed By: akankshamahajan15

Differential Revision: D40198988

Pulled By: akankshamahajan15

fbshipit-source-id: 5db74f12dd8854f6288fbbf8775c8e759778c307

f366f90b

Exclude timestamp when checking compaction boundaries (#10787) · 11943e8b

由 Yanqin Jin 提交于 10月 07, 2022

Summary:
When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40187672

Pulled By: ltamasi

fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496

11943e8b

Verify wide columns during prefix scan in stress tests (#10786) · 7af47c53

由 Levi Tamasi 提交于 10月 07, 2022

Summary:
The patch adds checks to the
`{NonBatchedOps,BatchedOps,CfConsistency}StressTest::TestPrefixScan` methods
to make sure the wide columns exposed by the iterators are as expected (based on
the value base encoded into the iterator value). It also makes some code hygiene
improvements in these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10786

Test Plan:
Ran some simple blackbox tests in the various modes (non-batched, batched,
CF consistency).

Reviewed By: riversand963

Differential Revision: D40163623

Pulled By: riversand963

fbshipit-source-id: 72f4c3b51063e48c15f974c4ec64d751d3ed0a83

7af47c53

07 10月, 2022 4 次提交

Expand stress test coverage for min_write_buffer_number_to_merge (#10785) · 943247b7

由 Yanqin Jin 提交于 10月 06, 2022

Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10785

Test Plan: CI

Reviewed By: ltamasi

Differential Revision: D40162583

Pulled By: ltamasi

fbshipit-source-id: 4e01f9b682f397130e286cf5d82190b7973fa3c1

943247b7

Use `sstableKeyCompare()` for compaction output boundary check (#10763) · 23fa5b77

由 Jay Zhuang 提交于 10月 06, 2022

Summary:
To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like:
```
 L1: [2-21]  [22-30]
 L2: [1-10] [21-30]
```
Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped.
`sstableKeyCompare()` also take range delete into consideration which may cut file for the same key.
It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763

Reviewed By: cbi42

Differential Revision: D39971904

Pulled By: cbi42

fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8

23fa5b77

Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783) · d6d8c007

由 Levi Tamasi 提交于 10月 06, 2022

Summary:
As the first step of covering the wide-column functionality of iterators
in our stress tests, the patch adds verification logic to
`NonBatchedOpsStressTest::VerifyDb` that checks whether the
iterator's value and columns are in sync. Note: I plan to update the other
types of stress tests and add similar verification for prefix scans etc.
in separate PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783

Test Plan: Ran some simple blackbox crash tests.

Reviewed By: riversand963

Differential Revision: D40152370

Pulled By: riversand963

fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35

d6d8c007

Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768) · b205c6d0

由 Peter Dillinger 提交于 10月 06, 2022

Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)

This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768

Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)

Reviewed By: anand1976

Differential Revision: D40034747

Pulled By: anand1976

fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68

b205c6d0

06 10月, 2022 3 次提交

Address feedback on recent recovery testing blog post (#10780) · f461e064

由 Andrew Kryczka 提交于 10月 05, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10780

Reviewed By: hx235

Differential Revision: D40120327

Pulled By: hx235

fbshipit-source-id: 08b43a11cee11743b4428dd2a9aff44270668e05

f461e064

Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773) · 4d82b948

由 Yanqin Jin 提交于 10月 05, 2022

Summary:
With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773

Test Plan:
Reproduce: D39993203 has detailed steps.
Run the test with and without the fix.

Reviewed By: cbi42

Differential Revision: D40077955

Pulled By: cbi42

fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240

4d82b948

Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767) · eca47fb6

由 Changyu Bi 提交于 10月 05, 2022

Summary:
fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767

Test Plan: Added a unit test to reproduce the compaction loop.

Reviewed By: ajkr

Differential Revision: D40031861

Pulled By: ajkr

fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b

eca47fb6

05 10月, 2022 5 次提交

blog post: Verifying crash-recovery with lost buffered writes (#10775) · 00d697bd

由 Andrew Kryczka 提交于 10月 04, 2022

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10775

Reviewed By: hx235

Differential Revision: D40090300

Pulled By: hx235

fbshipit-source-id: 1358f0a4a1583b49548305cfd1477e520c8985ba

00d697bd

Cleanup SuperVersion in Iterator::Refresh() (#10770) · ffde463a

由 Changyu Bi 提交于 10月 04, 2022

Summary:
Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770

Test Plan: added a unit test to reproduce the issue.

Reviewed By: ajkr

Differential Revision: D40063809

Pulled By: ajkr

fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2

ffde463a

Manual flush with `wait=false` should not stall when writes stopped (#10001) · edda219f

由 Yanqin Jin 提交于 10月 04, 2022

Summary:
When `FlushOptions::wait` is set to false, manual flush should not stall forever.

If the database has already stopped writes, then the thread calling `DB::Flush()` with
`FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`.

To prevent this, we should do a check at the beginning and return `TryAgain()`

Resolves: https://github.com/facebook/rocksdb/issues/9892

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001

Reviewed By: siying

Differential Revision: D36422303

Pulled By: siying

fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a

edda219f

RoundRobin TTL compaction (#10725) · f007ad8b

由 Jay Zhuang 提交于 10月 04, 2022

Summary:
For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725

Reviewed By: ajkr

Differential Revision: D39771069

Pulled By: jay-zhuang

fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd

f007ad8b

ci: add GitHub token permissions for workflow (#10549) · 626eaa41

由 Varun Sharma 提交于 10月 04, 2022

Summary:
This PR adds minimum token permissions for the GITHUB_TOKEN in GitHub Actions workflows using https://github.com/step-security/secure-workflows.

GitHub recommends defining minimum GITHUB_TOKEN permissions for securing GitHub Actions workflows
- https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/
- https://docs.github.com/en/actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
- The Open Source Security Foundation (OpenSSF) [Scorecards](https://github.com/ossf/scorecard) treats not setting token permissions as a high-risk issue

This project is part of the top 100 critical projects as per OpenSSF (https://github.com/ossf/wg-securing-critical-projects), so fixing the token permissions to improve security.

Before the change:
`GITHUB_TOKEN` has `write` permissions for multiple scopes, e.g.
https://github.com/facebook/rocksdb/runs/7936368166?check_suite_focus=true#step:1:19

After the change:
`GITHUB_TOKEN` will have minimum permissions needed for the jobs.
Signed-off-by: NVarun Sharma <varunsh@stepsecurity.io>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10549

Reviewed By: ajkr

Differential Revision: D38923184

Pulled By: jay-zhuang

fbshipit-source-id: 0c48f98fe90665e53724f57a7d3b01dd80f34a93

626eaa41

04 10月, 2022 1 次提交

Some clean-up of secondary cache (#10730) · 5f4391dd

由 Peter Dillinger 提交于 10月 03, 2022

Summary:
This is intended as a step toward possibly separating secondary cache integration from the
Cache implementation as much as possible, to (hopefully) minimize code duplication in
adding secondary cache support to HyperClockCache.
* Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles.
* Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle).
* Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction.
* Drop "incomplete" terminology, which was another name for "pending".
* Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady().
* Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.)
* Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().)

Intended follow-up:
* Clarify APIs in secondary_cache.h
  * Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)?
  * Does Wait() need to be called if IsReady() == true? (This would be different from Cache.)
  * Do Value() and Size() have undefined behavior if IsReady() == false?
  * Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>?
* Improve unit testing of standalone handle case
* Apparent null `e` bug in `free_standalone_handle` case
* Clean up secondary cache testing in lru_cache_test
  * Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle?
  * Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true?
  * (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation?
  * Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady().
* Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730

Test Plan:
existing tests (minor update), and crash test with sanitizers and secondary cache

Performance test for any regressions in LRUCache (primary only):
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test before & after (run at same time) with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16
```
Before: readrandom [AVG    100 runs] : 22234 (± 63) ops/sec;    1.6 (± 0.0) MB/sec
After: readrandom [AVG    100 runs] : 22197 (± 64) ops/sec;    1.6 (± 0.0) MB/sec
That's within 0.2%, which is not significant by the confidence intervals.

Reviewed By: anand1976

Differential Revision: D39826010

Pulled By: anand1976

fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44

5f4391dd

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功