- 07 9月, 2023 4 次提交
-
-
由 Changyu Bi 提交于
Summary: Tests a scenario where range tombstone reseek used to cause MergingIterator to discard non-ok status. Ran on main without https://github.com/facebook/rocksdb/issues/11786: ``` ./db_range_del_test --gtest_filter="*RangeDelReseekAfterFileReadError*" Note: Google Test filter = *RangeDelReseekAfterFileReadError* [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBRangeDelTest [ RUN ] DBRangeDelTest.RangeDelReseekAfterFileReadError db/db_range_del_test.cc:3577: Failure Value of: iter->Valid() Actual: true Expected: false [ FAILED ] DBRangeDelTest.RangeDelReseekAfterFileReadError (64 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11790 Reviewed By: ajkr Differential Revision: D48972869 Pulled By: cbi42 fbshipit-source-id: b1a71867533b0fb60af86f8ce8a9e391ba84dd57
-
由 anand76 提交于
Summary: https://github.com/facebook/rocksdb/issues/11789 added error injection during compaction to db_stress. However, error injection was not disabled after compaction completion, which resulted in some test failures due to stale errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11798 Reviewed By: cbi42 Differential Revision: D49022821 Pulled By: anand1976 fbshipit-source-id: 3cbfe18d55bee393697e063d05e7a7a7f88b7635
-
由 Changyu Bi 提交于
Summary: Some repro unit tests for the bug fixed in https://github.com/facebook/rocksdb/pull/11782. Ran on main without https://github.com/facebook/rocksdb/pull/11782: ``` ./db_compaction_test --gtest_filter='*ErrorWhenReadFileHead' Note: Google Test filter = *ErrorWhenReadFileHead [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.ErrorWhenReadFileHead db/db_compaction_test.cc:10105: Failure Value of: s.IsIOError() Actual: false Expected: true [ FAILED ] DBCompactionTest.ErrorWhenReadFileHead (3960 ms) ./db_iterator_test --gtest_filter="*ErrorWhenReadFile*" Note: Google Test filter = *ErrorWhenReadFile* [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBIteratorTest [ RUN ] DBIteratorTest.ErrorWhenReadFile db/db_iterator_test.cc:3399: Failure Value of: (iter->status()).ok() Actual: true Expected: false [ FAILED ] DBIteratorTest.ErrorWhenReadFile (280 ms) [----------] 1 test from DBIteratorTest (280 ms total) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11788 Reviewed By: ajkr Differential Revision: D48940284 Pulled By: cbi42 fbshipit-source-id: 06f3c5963f576db3f85d305ffb2745ee13d209bb
-
由 git-hulk 提交于
Summary: Currently, rocksdb users would use the event listener to catch the compaction/flush event and log them if any. But now the reason is an integer type instead of a human-readable string, so we would like to convert them into a human-readable string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11778 Reviewed By: jaykorean Differential Revision: D49012934 Pulled By: ajkr fbshipit-source-id: a4935b95d70c1be02aec65da7bf1c98a8cf8b933
-
- 06 9月, 2023 2 次提交
-
-
由 Peter Dillinger 提交于
Summary: There was a `#include "port/lang.h"` situated inside an `extern "C" {` which just started causing the header to be unusuable in some contexts. This was a regression on the CircleCI job build-linux-unity-and-headers in https://github.com/facebook/rocksdb/issues/11792 The include, and another like it, now appears obsolete so removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11797 Test Plan: local `make check-headers` and `make`, CI Reviewed By: jaykorean Differential Revision: D48976826 Pulled By: pdillinger fbshipit-source-id: 131d66969e045f2ded0f8936924ee30c9ef2655a
-
由 Andrew Kryczka 提交于
Summary: - Fixed misspellings of "inject" - Made user read errors retryable when `FLAGS_inject_error_severity == 1` - Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes - Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789 Test Plan: With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug: - Build: `make -j56 db_stress` - Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000` - Output: ``` stderr has error message: ***put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.*** ``` Reviewed By: cbi42 Differential Revision: D48939994 Pulled By: ajkr fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb
-
- 05 9月, 2023 1 次提交
-
-
由 Peter Dillinger 提交于
Summary: Forgot to run TSAN test on latest revision of https://github.com/facebook/rocksdb/issues/11738 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11792 Test Plan: Use cache_bench to reproduce TSAN errors and observe fix Reviewed By: ajkr Differential Revision: D48953196 Pulled By: pdillinger fbshipit-source-id: 9e358b4768d8ddde86f84b451863263f661d7b80
-
- 02 9月, 2023 4 次提交
-
-
由 hulk 提交于
Summary: [Apache Kvrocks](https://github.com/apache/kvrocks) is an open-source distributed key-value NoSQL database built on top of RocksDB. It serves as a cost-saving and capacity-increasing alternative drop-in replacement for Redis Pull Request resolved: https://github.com/facebook/rocksdb/pull/11779 Reviewed By: ajkr Differential Revision: D48872257 Pulled By: cbi42 fbshipit-source-id: 507f67d69b826607a1464a22ec7c60abe11c5124
-
由 Peter Dillinger 提交于
Summary: This change add an experimental next-generation HyperClockCache (HCC) with automatic sizing of the underlying hash table. Both the existing version (stable) and the new version (experimental for now) of HCC are available depending on whether an estimated average entry charge is provided in HyperClockCacheOptions. Internally, we call the two implementations AutoHyperClockCache (new) and FixedHyperClockCache (existing). The performance characteristics and much of the underlying logic are similar enough that AutoHCC is likely to make FixedHCC obsolete, and so it's best considered an evolution of the same technology or solution rather than an alternative. More specifically, both implementations share essentially the same logic for managing the state of individual entries in the cache, including metadata for reference counting and counting clocks for eviction. This metadata, which I like to call the "low-level HCC protocol," includes a read-write lock on entries, but relaxed consistency requirements on the cache (e.g. allowing rare duplication) means high-level cache operations never wait for these low-level per-entry locks. FixedHCC is fully wait-free. AutoHCC is different in how entries are indexed into an efficient hash table. AutoHCC is "essentially wait-free" as there is no pattern of typical high-level operations on a large cache that can lead to one thread waiting on another to complete some work, though it can happen in some unusual/unlucky cases, or atypical uses such as erasing specific cache keys. Table growth and entry reclamation is more complex in AutoHCC compared to FixedHCC, so uses some localized locking to manage that. AutoHCC uses linear hashing to grow the table as needed, with low latency and to a precise size. AutoHCC depends on anonymous mmap support from the OS (currently verified working on Linux, MacOS, and Windows) to allow the array underlying a hash table to grow in place without wasting resident memory on space reserved but unused. AutoHCC uses a form of chaining while FixedHCC uses open addressing and double hashing. More specifics: * In developing this PR, a rare availability bug (minor) was noticed in the existing HCC implementation of Release()+erase_if_last_ref, which is now inherited into AutoHCC. Fixing this without a performance regression will not be simple, so is left for follow-up work. * Some existing unit tests required adjustment of operational parameters or conditions to work with the new behaviors of AutoHCC. A number of bugs were found and fixed in the validation process, including getting unit tests in good working order. * Added an option to cache_bench, `-degenerate_hash_bits` for correctness stress testing described below. For this, the tool uses the reverse-engineered hash function for HCC to generate keys in which the specified number of hash bits, in critical positions, have a fixed value. Essentially each degenerate hash bit will half the number of chain heads utilized and double the average chain length. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11738 Test Plan: unit tests updated, and already added to db crash test. Also ## Correctness The code includes generous assertions to check for unexpected states, especially at destruction time, so should be able to detect critical concurrency bugs. Less serious "availability bugs" in which cache data is hidden or cleanly lost are more difficult to detect, but also less scary for data correctness (as long as performance is good and the design is sound). In average operation, the structure is extremely low stress and low contention (see next section) so stressing the corner case logic requires artificially stressing the operating conditions. First, we keep the structure small to increase the number of threads hitting the same chain or entry, and just one cache shard. Second, we artificially degrade the hashing so that chains are much longer than typical, using the new `-degenerate_hash_bits` option to cache_bench. Third, we re-create the structure from scratch frequently in order to exercise the Grow logic repeatedly and to get the benefit of the consistency checks in the structure's destructor in debug builds. For cache_bench this also means disabling the single-threaded "populate cache" step (normally used for steady state performance testing). And of course use many more threads than cores to have many preemptions. An effective test for working out bugs was this (using debug build of course): ``` while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=8000000 -threads=100 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done ``` Or even smaller cases. This setup has around 27 utilized chains, with around 35 entries each, and yield-waits more than 1 million times per second (very high contention; see next section). I have let this run for hours searching for any lingering issues. I've also run cache_bench under ASAN, UBSAN, and TSAN. ## Essentially wait free There is a counter for number of yield() calls when one thread is waiting on another. When we pre-populate the structure in a single thread, ``` ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -populate_cache=1 -ops_per_thread=200000 2>&1 | grep Yield ``` We see something on the order of 1 yield call per second across 16 threads, even when we load the system other other jobs (parallel compilation). With -populate_cache=0, there are more yield opportunities with parallel table growth. On an otherwise unloaded system, we still see very small (single digit) yield counts, with a chance of getting into the thousands, and getting into 10s of thousands per second during table growth phase if the system is loaded with other jobs. However, I am not worried about this if performance is still good (see next section). ## Overall performance Although cache_bench initially suggested performance very close to FixedHCC, there was a very noticeable performance hit under a db_bench setup like used in validating https://github.com/facebook/rocksdb/issues/10626. Much of the difference has been reduced by optimizing Lookup with a "naive" pass that will almost always find entries quickly, and only falling back to the careful Lookup algorithm when not found in the first pass. Setups (chosen to be sensitive to block cache performance), and compiled with USE_CLANG=1 JEMALLOC=1 PORTABLE=0 DEBUG_LEVEL=0: ``` TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` ### No regression on FixedHCC Running before & after builds at the same time on a 48 core machine. ``` TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=readrandom[-X10],block_cache_entry_stats,cache_report_problems -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=24 -cache_type=fixed_hyper_clock_cache -seed=1234 ``` Before: readrandom [AVG 10 runs] : 847234 (± 8150) ops/sec; 59.2 (± 0.6) MB/sec 703MB max RSS After: readrandom [AVG 10 runs] : 851021 (± 7929) ops/sec; 59.5 (± 0.6) MB/sec 706MB max RSS Probably no material difference. ### Single-threaded performance Using `[-X2]` and `-threads=1` and `-duration=30`, running all three at the same time: lru_cache: 55100 ops/sec, then 55862 ops/sec (627MB max RSS) fixed_hyper_clock_cache: 60496 ops/sec, then 61231 ops/sec (626MB max RSS) auto_hyper_clock_cache: 47560 ops/sec, then 56081 ops/sec (626MB max RSS) So AutoHCC has more ramp-up cost in the first pass as the cache grows to the appropriate size. (In single-threaded operation, the parallelizability and per-op low latency of table growth is overall slower.) However, once up to size, its performance is comparable to LRUCache. FixedHCC's lean operations still win overall when a good estimate is available. If we look at HCC table stats, we can see that this configuration is not favorable to AutoHCC (and I have verified that other memory sizes do not yield substantially different results, until shards are under-sized for the full filters): FixedHCC: Slot occupancy stats: Overall 47% (124991/262144), Min/Max/Window = 28%/64%/500, MaxRun{Pos/Neg} = 17/22 AutoHCC: Slot occupancy stats: Overall 59% (125781/209682), Min/Max/Window = 43%/82%/500, MaxRun{Pos/Neg} = 76/16 Head occupancy stats: Overall 43% (92259/209682), Min/Max/Window = 24%/74%/500, MaxRun{Pos/Neg} = 19/26 Entries at home count: 53350 FixedHCC configuration is relatively good for speed, and not ideal for space utilization. As is typical, AutoHCC has tighter control on metadata usage (209682 x 64 bytes rather than 262144 x 64 bytes), and the higher load factor is slightly worse for speed. LRUCache also has more metadata usage, at 199680 x 96 bytes of tracked metadata (plus roughly another 10% of that untracked in the head pointers), and that metadata is subject to fragmentation. ### Parallel performance, high hit rate Now using `[-X10]` and `-threads=10`, all three at the same time lru_cache: [AVG 10 runs] : 263629 (± 1425) ops/sec; 18.4 (± 0.1) MB/sec 655MB max RSS, 97.1% cache hit rate fixed_hyper_clock_cache: [AVG 10 runs] : 479590 (± 8114) ops/sec; 33.5 (± 0.6) MB/sec 651MB max RSS, 97.1% cache hit rate auto_hyper_clock_cache: [AVG 10 runs] : 418687 (± 5915) ops/sec; 29.3 (± 0.4) MB/sec 657MB max RSS, 97.1% cache hit rate Even with just 10-way parallelism for each cache (though 30+/48 cores busy overall), LRUCache is already showing performance degradation, while AutoHCC is in the neighborhood of FixedHCC. And that brings us to the question of how AutoHCC holds up under extreme parallelism, so now independent runs with `-threads=100` (overloading 48 cores). lru_cache: 438613 ops/sec, 827MB max RSS fixed_hyper_clock_cache: 1651310 ops/sec, 812MB max RSS auto_hyper_clock_cache: 1505875 ops/sec, 821MB max RSS (Yield count: 1089 over 30s) Clearly, AutoHCC holds up extremely well under extreme parallelism, even closing some of the modest performance gap with FixedHCC. ### Parallel performance, low hit rate To get down to roughly 50% cache hit rate, we use `-cache_index_and_filter_blocks=0 -cache_size=1650000000` with `-threads=10`. Here the extra cost of running counting clock eviction, especially on the chains of AutoHCC, are evident, especially with the lower contention of cache_index_and_filter_blocks=0: lru_cache: 725231 ops/sec, 1770MB max RSS, 51.3% hit rate fixed_hyper_clock_cache: 638620 ops/sec, 1765MB max RSS, 50.2% hit rate auto_hyper_clock_cache: 541018 ops/sec, 1777MB max RSS, 50.8% hit rate Reviewed By: jowlyzhang Differential Revision: D48784755 Pulled By: pdillinger fbshipit-source-id: e79813dc087474ac427637dd282a14fa3011a6e4
-
由 Changyu Bi 提交于
Summary: This should only affect iterator when - user uses DeleteRange(), - An iterator from level L has a non-ok status (such non-ok status may not be caught before the bug fix in https://github.com/facebook/rocksdb/pull/11783), and - A range tombstone covers a key from level > L and triggers a reseek sets the status_ to OK in SeekImpl()/SeekPrevImpl() e.g. https://github.com/facebook/rocksdb/blob/bd6a8340c3a2db764620e90b3ac5be173fc68a0c/table/merging_iterator.cc#L801 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11786 Differential Revision: D48908830 Pulled By: cbi42 fbshipit-source-id: eb564be375af4e33dc27542eff753260186e6d5d
-
由 Changyu Bi 提交于
Summary: This happens in (Compaction)MergingIterator layer, and can cause data loss during compaction or read/scan return incorrect result Pull Request resolved: https://github.com/facebook/rocksdb/pull/11782 Reviewed By: ajkr Differential Revision: D48880575 Pulled By: cbi42 fbshipit-source-id: 2294ad284a6d653d3674bebe55380f12ee4b645b
-
- 01 9月, 2023 1 次提交
-
-
由 Jay Huh 提交于
Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11754 , refactor to clean up some nearly identical logic. This PR changes the existing debugging string format of Scan command as the following. ``` ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex ``` Before ``` 0x6669727374 : :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 : 0x62617A ``` After ``` 0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 ==> 0x62617A ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11777 Test Plan: ``` ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz Keys in range: 3 ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump --hex 0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 ==> 0x62617A Keys in range: 3 ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex 0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 ==> 0x62617A ``` Reviewed By: jowlyzhang Differential Revision: D48876755 Pulled By: jaykorean fbshipit-source-id: b1c608a810fe038999ac528b690d398abf5f21d7
-
- 31 8月, 2023 6 次提交
-
-
由 Peter Dillinger 提交于
Summary: ... in info_log. Becoming more important with disaggregated storage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11776 Test Plan: manual Reviewed By: jaykorean Differential Revision: D48849471 Pulled By: pdillinger fbshipit-source-id: 9a8fd8b2564a4f133526ecd7c1414cb667e4ba54
-
由 Hui Xiao 提交于
Summary: **Context/Summary:** After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO. However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` . - `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization. Therefore we decided to do this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762 Test Plan: Modified existing UTs to cover this PR Reviewed By: ajkr Differential Revision: D48759560 Pulled By: hx235 fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79
-
由 Yu Zhang 提交于
Summary: For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757 Test Plan: make all check for changes in db_impl.cc and db_impl_compaction_flush.cc for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually. Reviewed By: ltamasi Differential Revision: D48668048 Pulled By: jowlyzhang fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
-
由 Jay Huh 提交于
Summary: wide_columns can now be pretty-printed in the following commands - `./ldb dump_wal` - `./ldb dump` - `./ldb idump` - `./ldb dump_live_files` - `./ldb scan` - `./sst_dump --command=scan` There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754 Test Plan: **New Tests added** - `WideColumnsHelperTest::DumpWideColumns` - `WideColumnsHelperTest::DumpSliceAsWideColumns` **Changes added to existing tests** - `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool. - `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST **dump_wal** ``` ./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header ``` ``` Sequence,Count,ByteSize,Physical Offset,Key(s) : value 1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A ``` **idump** ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump ``` ``` 'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:2, type:22 => attr_one:two attr_three:four 'third' seq:3, type:1 => baz Internal keys in range: 3 ``` **SST Dump from dump_live_files** ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files ``` ``` ... ============================== SST Files ============================== /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1 ------------------------------ Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst Sst file format: block-based 'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:0, type:22 => attr_one:two attr_three:four 'third' seq:0, type:1 => baz ... ``` **dump** ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump ``` ``` first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz Keys in range: 3 ``` **scan** ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan ``` ``` first : :hello attr_name1:foo attr_name2:bar second : attr_one:two attr_three:four third : baz ``` **sst_dump** ``` ./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan ``` ``` options.env is 0x7ff54b296000 Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst Sst file format: block-based from [] to [] 'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:0, type:22 => attr_one:two attr_three:four 'third' seq:0, type:1 => baz ``` Reviewed By: ltamasi Differential Revision: D48837999 Pulled By: jaykorean fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
-
由 Hui Xiao 提交于
Summary: …n change (https://github.com/facebook/rocksdb/issues/11755)" This reverts commit 45131659. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11773 Reviewed By: ajkr Differential Revision: D48832320 Pulled By: hx235 fbshipit-source-id: 96cef26a885134360766a83505f6717598eac6a9
-
由 Yu Zhang 提交于
Summary: This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774 Test Plan: Added unit test make all check Reviewed By: ltamasi Differential Revision: D48799109 Pulled By: jowlyzhang fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035
-
- 30 8月, 2023 4 次提交
-
-
由 jsteemann 提交于
Summary: the value of `done` is always false here, so the sub-condition `!done` will always be true and the check can be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11746 Reviewed By: anand1976 Differential Revision: D48656845 Pulled By: ajkr fbshipit-source-id: 523ba3d07b3af7880c8c8ccb20442fd7c0f49417
-
由 Andrew Kryczka 提交于
Summary: Having a synthetic implementation of `TimedWait()` in `SystemClock` will allow us to add `SyncPoint`s while mutex is released, which was previously impossible since the lock was released and reacquired all within `pthread_cond_timedwait()`. Additionally, integrating `TimedWait()` with `MockSystemClock` allows us to cleanup some workarounds in the test code. In this PR I only cleaned up the `GenericRateLimiter` test workaround. This is related to the intended follow-up mentioned in https://github.com/facebook/rocksdb/issues/7101's description. There are a couple differences: (1) This PR does not include removing the particular workaround that initially motivated it. Actually, the `Timer` class uses `InstrumentedCondVar`, so the interface introduced here is inadequate to remove that workaround. On the bright side, the interface introduced in this PR can be changed as needed since it can neither be used nor extended externally, due to using forward-declared `port::CondVar*` in the interface. (2) This PR only makes the change in `SystemClock` not `Env`. Older revisions of this PR included `Env::TimedWait()` and `SpecialEnv::TimedWait()`; however, since they were unused it probably makes sense to defer adding them until when they are needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11753 Reviewed By: pdillinger Differential Revision: D48654995 Pulled By: ajkr fbshipit-source-id: 15e19f2454b64d4ec7f50e328691c66ca9911122
-
由 jsteemann 提交于
Summary: when a key is recorded for locking in a pessimistic transaction, the key is first looked up in a map, and then inserted into the map if it was not already contained. this can be simplified to an unconditional insert. in the ideal case that all keys are unique, this saves all the find() operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11743 Reviewed By: anand1976 Differential Revision: D48656798 Pulled By: ajkr fbshipit-source-id: d0150de2db757e0c05e1797cfc24380790c71276
-
由 Yu Zhang 提交于
Summary: The user-defined timestamps feature only enforces that for the same key, user-defined timestamps should be non-decreasing. For the user-defined timestamps in memtable only feature, during flush, we check the user-defined timestamps in each memtable to examine if the data is considered expired with regard to `full_history_ts_low`. In this process, it's assuming that a newer memtable should not have smaller user-defined timestamps than an older memtable. This check however is enforcing ordering of user-defined timestamps across keys, as apposed to the vanilla UDT feature, that only enforce ordering of user-defined timestamps for the same key. This more strict user-defined timestamp ordering requirement could be an issue for secondary instances where commits can be out of order. And after thinking more about it, this requirement is really an overkill to keep the invariants of `full_history_ts_low` which are: 1) users cannot read below `full_history_ts_low` 2) users cannot write at or below `full_history_ts_low` 3) `full_history_ts_low` can only be increasing As long as RocksDB enforces these 3 checks, we can prohibit inconsistent read that returns a different value. And these three checks are covered in existing APIs. So this PR removes the extra checks in the UDT in memtable only feature that requires user-defined timestamps to be non decreasing across keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11732 Reviewed By: ltamasi Differential Revision: D48541466 Pulled By: jowlyzhang fbshipit-source-id: 95453c6e391cbd511c0feab05f0b11c312d17186
-
- 29 8月, 2023 2 次提交
-
-
由 akankshamahajan 提交于
Summary: Fix seg fault in auto_readahead_size with async_io when readahead_size = 0. If readahead_size is trimmed and is 0, it's not eligible for further prefetching and should return. Error occured when the first buffer already contains data and it goes for prefetching in second buffer leading to assertion failure - `assert(roundup_len1 >= alignment); ` because roundup_len1 = length + readahead_size. length is 0 and readahead_size is also 0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11769 Test Plan: Reproducible with db_stress with async_io enabled. Reviewed By: anand1976 Differential Revision: D48743031 Pulled By: akankshamahajan15 fbshipit-source-id: 0e08c41f862f6287ca223fbfaf6cd42fc97b3c87
-
由 Andrew Kryczka 提交于
Summary: Fixes https://github.com/facebook/rocksdb/issues/11742 Even after performing duty (1) ("Waiting for the next refill time"), it is possible the remaining threads are all in `Wait()`. Waking up at least one thread is enough to ensure progress continues, even if no new requests arrive. The repro unit test (https://github.com/facebook/rocksdb/commit/bb54245e6) is not included as it depends on an unlanded PR (https://github.com/facebook/rocksdb/issues/11753) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11763 Reviewed By: jaykorean Differential Revision: D48710130 Pulled By: ajkr fbshipit-source-id: 9d166bd577ea3a96ccd81dde85871fec5e85a4eb
-
- 26 8月, 2023 3 次提交
-
-
由 Jan 提交于
Summary: `VersionBuilderMap` type alias definition seem unused. If this PR can be compiled fine then the alias is probably not needed anymore. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11286 Reviewed By: jaykorean Differential Revision: D48656747 Pulled By: ajkr fbshipit-source-id: ac8554922aead7dc3d24fe7e6544a4622578c514
-
由 Richard Barnes 提交于
Del `(object)` from 200 inc instagram-server/distillery/slipstream/thrift_models/StoryFeedMediaSticker/ttypes.py Summary: Python3 makes the use of `(object)` in class inheritance unnecessary. Let's modernize our code by eliminating this. Reviewed By: itamaro Differential Revision: D48673915 fbshipit-source-id: a1a6ae8572271eb2898b748c8216ea68e362f06a
-
由 akankshamahajan 提交于
Summary: Fix seg fault in auto_readahead_size ``` db_stress: internal_repo_rocksdb/repo/table/block_based/partitioned_index_iterator.h:70: virtual rocksdb::IndexValue rocksdb::PartitionedIndexIterator::value() const: Assertion `Valid()' failed. ``` During seek, after calculating readahead_size, db_stress can inject IOError resulting in failure to index_iter_->Seek and making index_iter_ invalid. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11761 Test Plan: Reproducible locally and passed with this fix Reviewed By: anand1976 Differential Revision: D48696248 Pulled By: akankshamahajan15 fbshipit-source-id: 2be43bf56ad0fc2f95f9093c19c9a1b15a716091
-
- 25 8月, 2023 3 次提交
-
-
由 Peter Dillinger 提交于
Summary: * Add some options to cache_bench to use JemallocNodumpAllocator * Make num_shard_bits option use and report cache-specific defaults * Add a usleep option to sleep between operations, for simulating a workload with more CPU idle/wait time. * Use const& for JemallocAllocatorOptions, to improve API usability (e.g. can bind to temporary `{}`) * InstallStackTraceHandler() Pull Request resolved: https://github.com/facebook/rocksdb/pull/11758 Test Plan: manual Reviewed By: jowlyzhang Differential Revision: D48668479 Pulled By: pdillinger fbshipit-source-id: b6032fbe09444cdb8f1443a5e017d2eea4f6205a
-
由 Akanksha Mahajan 提交于
Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/11729 Test Plan: make crash_test -j32 Reviewed By: anand1976 Differential Revision: D48534820 Pulled By: akankshamahajan15 fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e
-
由 Hui Xiao 提交于
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11755 Reviewed By: anand1976 Differential Revision: D48656627 Pulled By: hx235 fbshipit-source-id: 568fa7749cbf6ecf65102b4513fa3af975fd91b8
-
- 24 8月, 2023 2 次提交
-
-
由 Fuat Basik 提交于
Summary: In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592 Reviewed By: cbi42 Differential Revision: D48003998 Pulled By: hx235 fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
-
由 Hui Xiao 提交于
Summary: **Context/Summary:** Same intention as https://github.com/facebook/rocksdb/pull/2693 - basically we now pick from the last sorted run and expand forward till we can't Pull Request resolved: https://github.com/facebook/rocksdb/pull/11740 Test Plan: Existing UT Stress test Reviewed By: ajkr Differential Revision: D48586475 Pulled By: hx235 fbshipit-source-id: 3eb3c3ee1d5f7e0b0d6d649baaeb8c6990fee398
-
- 23 8月, 2023 3 次提交
-
-
由 Alexander Bulimov 提交于
Summary: Add a bunch of C API functions to expose new `WaitForCompact` function and related options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11737 Test Plan: unit tests Reviewed By: jaykorean Differential Revision: D48568239 Pulled By: abulimov fbshipit-source-id: 1ff35972d7abacd7e1e17fe2ada1e20cdc88d8de
-
由 chuhao zeng 提交于
Summary: Fix https://github.com/facebook/rocksdb/issues/6470 Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11725 Reviewed By: akankshamahajan15 Differential Revision: D48529411 Pulled By: ajkr fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517
-
由 Changyu Bi 提交于
Summary: Currently the stress test does not support restoring expected state (to a specific sequence number) when there is unsynced data loss during the reopen phase. This causes a few internal stress test failure with errors like inconsistent value. This PR disables dropping unsynced data during reopen to avoid failures due to this issue. We can re-enable later after we decide to support unsynced data loss during DB reopen in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11731 Test Plan: * Running this test a few times can fail for inconsistent value before this change ``` ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=20.57166126835524 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=6 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=1 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=5 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=0 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=3 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35``` Reviewed By: hx235 Differential Revision: D48537494 Pulled By: cbi42 fbshipit-source-id: ddae21b9bb6ee8d67229121f58513e95f7ef6d8d
-
- 22 8月, 2023 5 次提交
-
-
由 Yu Zhang 提交于
Summary: For some ldb commands that doesn't need to open the DB, it's still useful to use the DB's existing OPTIONS file if it's available. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11721 Reviewed By: pdillinger Differential Revision: D48485540 Pulled By: jowlyzhang fbshipit-source-id: 2d2db837523044066f1a2c4b59a5c03f6cd35e6b
-
由 anand76 提交于
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11728 Reviewed By: jaykorean, jowlyzhang Differential Revision: D48527100 Pulled By: anand1976 fbshipit-source-id: c48baa44e538fb6bfd3fe7f19046746d3540763f
-
由 Jay Huh 提交于
Summary: As the new API to wait for compaction is available (https://github.com/facebook/rocksdb/issues/11436), we can now replace the existing logic of waiting in db_bench_tool with the new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11727 Test Plan: ``` ./db_bench --benchmarks="fillrandom,compactall,waitforcompaction,readrandom" ``` **Before change** ``` Set seed to 1692635571470041 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 8.6.0 Date: Mon Aug 21 09:33:40 2023 CPU: 80 * Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz CPUCache: 28160 KB Keys: 16 bytes each (+ 0 bytes user-defined timestamp) Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 WARNING: Optimization is disabled: benchmarks unnecessarily slow WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled DB path: [/tmp/rocksdbtest-226125/dbbench] fillrandom : 51.826 micros/op 19295 ops/sec 51.826 seconds 1000000 operations; 2.1 MB/s waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished DB path: [/tmp/rocksdbtest-226125/dbbench] readrandom : 39.042 micros/op 25613 ops/sec 39.042 seconds 1000000 operations; 1.8 MB/s (632886 of 1000000 found) ``` **After change** ``` Set seed to 1692636574431745 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 8.6.0 Date: Mon Aug 21 09:49:34 2023 CPU: 80 * Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz CPUCache: 28160 KB Keys: 16 bytes each (+ 0 bytes user-defined timestamp) Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 WARNING: Optimization is disabled: benchmarks unnecessarily slow WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled DB path: [/tmp/rocksdbtest-226125/dbbench] fillrandom : 51.271 micros/op 19504 ops/sec 51.271 seconds 1000000 operations; 2.2 MB/s waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished with status (OK) DB path: [/tmp/rocksdbtest-226125/dbbench] readrandom : 39.264 micros/op 25468 ops/sec 39.264 seconds 1000000 operations; 1.8 MB/s (632921 of 1000000 found) ``` Reviewed By: ajkr Differential Revision: D48524667 Pulled By: jaykorean fbshipit-source-id: 1052a15b2ed79a35165ec4d9998d0454b2552ef4
-
由 Yu Zhang 提交于
Summary: This piggy back the existing last level file temperature statistics test to test the default temperature becoming effective. While adding this unit test, I found that the approach to swap out and use default temperature in `VersionBuilder::LoadTableHandlers` will miss the L0 files created from flush, and only work for existing SST files, SST files created by compaction. So this PR moves that logic to `TableCache::GetTableReader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11722 Test Plan: ``` ./db_test2 --gtest_filter="*LastLevelStatistics*" make all check ``` Reviewed By: pdillinger Differential Revision: D48489171 Pulled By: jowlyzhang fbshipit-source-id: ac29f7d484916f3218729594c5bb35c4f2979ac2
-
由 Levi Tamasi 提交于
Summary: [draft] this PR is created in order to test CI changes Closes: https://github.com/facebook/rocksdb/pull/11543 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11633 Reviewed By: akankshamahajan15 Differential Revision: D48525552 Pulled By: cbi42 fbshipit-source-id: 758d57f248304213228af459789459cc2f0bf419
-