1. 19 10月, 2022 1 次提交
    • Y
      Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348) · e267909e
      Yueh-Hsuan Chiang 提交于
      Summary:
      FIFO compaction can theoretically open a DB with any compaction style.
      However, the current code only allows FIFO compaction to open a DB with
      a single level.
      
      This PR relaxes the limitation of FIFO compaction and allows it to open a
      DB with multiple levels.  Below is the read / write / compaction behavior:
      
      * The read behavior is untouched, and it works like a regular rocksdb instance.
      * The write behavior is untouched as well.  When a FIFO compacted DB
      is opened with multiple levels, all new files will still be in level 0, and no files
      will be moved to a different level.
      * Compaction logic is extended.  It will first identify the bottom-most non-empty level.
      Then, it will delete the oldest file in that level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348
      
      Test Plan:
      Added a new test to verify the migration from level to FIFO where the db has multiple levels.
      Extended existing test cases in db_test and db_basic_test to also verify
      all entries of a key after reopening the DB with FIFO compaction.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40233744
      
      fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
      e267909e
  2. 18 10月, 2022 3 次提交
    • P
      Print stack traces on frozen tests in CI (#10828) · e466173d
      Peter Dillinger 提交于
      Summary:
      Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
      
      For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
      
      Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
      
      Reviewed By: hx235
      
      Differential Revision: D40447634
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
      e466173d
    • P
      Improve / refactor anonymous mmap capabilities (#10810) · 8367f0d2
      Peter Dillinger 提交于
      Summary:
      The motivation for this change is a planned feature (related to HyperClockCache) that will depend on a large array that can essentially grow automatically, up to some bound, without the pointer address changing and with guaranteed zero-initialization of the data. Anonymous mmaps provide such functionality, and this change provides an internal API for that.
      
      The other existing use of anonymous mmap in RocksDB is for allocating in huge pages. That code and other related Arena code used some awkward non-RAII and pre-C++11 idioms, so I cleaned up much of that as well, with RAII, move semantics, constexpr, etc.
      
      More specifcs:
      * Minimize conditional compilation
      * Add Windows support for anonymous mmaps
      * Use std::deque instead of std::vector for more efficient bag
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10810
      
      Test Plan: unit test added for new functionality
      
      Reviewed By: riversand963
      
      Differential Revision: D40347204
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ca83fcc47e50fabf7595069380edd2954f4f879c
      8367f0d2
    • L
      Do not adjust test_batches_snapshots to avoid mixing runs (#10830) · 11c0d131
      Levi Tamasi 提交于
      Summary:
      This is a small follow-up to https://github.com/facebook/rocksdb/pull/10821. The goal of that PR was to hold `test_batches_snapshots` fixed across all `db_stress` invocations; however, that patch didn't address the case when `test_batches_snapshots` is unset due to a conflicting `enable_compaction_filter` or `prefix_size` setting. This PR updates the logic so the other parameter is sanitized instead in the case of such conflicts.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10830
      
      Reviewed By: riversand963
      
      Differential Revision: D40444548
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 0331265704904b729262adec37139292fcbb7805
      11c0d131
  3. 17 10月, 2022 2 次提交
  4. 15 10月, 2022 1 次提交
  5. 14 10月, 2022 4 次提交
  6. 13 10月, 2022 2 次提交
    • M
      Several small improvements (#10803) · 6ff0c204
      Mark Callaghan 提交于
      Summary:
      This has several small improvements.
      
      benchmark.sh
      * add BYTES_PER_SYNC as an env variable
      * use --prepopulate_block_cache when O_DIRECT is used
      * use --undefok to list options that don't work for all 7.x releases
      * print "failure" in report.tsv when a benchmark fails
      * parse the slightly different throughput line used by db_bench for multireadrandom
      * remove the trailing comma for BlobDB size before printing it in report.tsv
      * use the last line of the output from /bin/time as there can be more than one line when db_bench has a non-zero exit
      * fix more bash lint warnings
      * add ",stats" to the --benchmark=... lines to get stats at the end of each benchmark
      
      benchmark_compare.sh
      * run revrange immediately after fillseq to let compaction debt get removed
      * add --multiread_batched when --benchmarks=multireadrandom is used
      * use --benchmarks=overwriteandwait when supported to get a more accurate measure of write-amp
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10803
      
      Test Plan: Run it for leveled, universal and BlobDB
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D40278315
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 793134ddc7d48d05a07436cd8942c375a23983a7
      6ff0c204
    • L
      Check columns in CfConsistencyStressTest::VerifyDb (#10804) · 23b7dc2f
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10804
      
      Reviewed By: riversand963
      
      Differential Revision: D40279057
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 9efc3dae7f5eaab162d55a41c58c2535b0a53054
      23b7dc2f
  7. 12 10月, 2022 1 次提交
    • L
      Consider wide columns when checksumming in the stress tests (#10788) · 85399b14
      Levi Tamasi 提交于
      Summary:
      There are two places in the stress test code where we compute the CRC
      for a range of KVs for the purposes of checking consistency, namely in the
      CF consistency test (to make sure CFs contain the same data), and when
      performing `CompactRange` (to make sure the pre- and post-compaction
      states are equivalent). The patch extends the logic so that wide columns
      are also considered in both cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10788
      
      Test Plan: Tested using some simple blackbox crash test runs.
      
      Reviewed By: riversand963
      
      Differential Revision: D40191134
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 542c21cac9077c6d225780deb210319bb5eee955
      85399b14
  8. 11 10月, 2022 9 次提交
  9. 08 10月, 2022 4 次提交
    • J
      Add option `preserve_internal_time_seconds` to preserve the time info (#10747) · c401f285
      Jay Zhuang 提交于
      Summary:
      Add option `preserve_internal_time_seconds` to preserve the internal
      time information.
      It's mostly for the migration of the existing data to tiered storage (
      `preclude_last_level_data_seconds`). When the tiering feature is just
      enabled, the existing data won't have the time information to decide if
      it's hot or cold. Enabling this feature will start collect and preserve
      the time information for the new data.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747
      
      Reviewed By: siying
      
      Differential Revision: D39910141
      
      Pulled By: siying
      
      fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
      c401f285
    • A
      Blog post for asynchronous IO (#10789) · f366f90b
      anand76 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10789
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D40198988
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5db74f12dd8854f6288fbbf8775c8e759778c307
      f366f90b
    • Y
      Exclude timestamp when checking compaction boundaries (#10787) · 11943e8b
      Yanqin Jin 提交于
      Summary:
      When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D40187672
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496
      11943e8b
    • L
      Verify wide columns during prefix scan in stress tests (#10786) · 7af47c53
      Levi Tamasi 提交于
      Summary:
      The patch adds checks to the
      `{NonBatchedOps,BatchedOps,CfConsistency}StressTest::TestPrefixScan` methods
      to make sure the wide columns exposed by the iterators are as expected (based on
      the value base encoded into the iterator value). It also makes some code hygiene
      improvements in these methods.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10786
      
      Test Plan:
      Ran some simple blackbox tests in the various modes (non-batched, batched,
      CF consistency).
      
      Reviewed By: riversand963
      
      Differential Revision: D40163623
      
      Pulled By: riversand963
      
      fbshipit-source-id: 72f4c3b51063e48c15f974c4ec64d751d3ed0a83
      7af47c53
  10. 07 10月, 2022 4 次提交
    • Y
      Expand stress test coverage for min_write_buffer_number_to_merge (#10785) · 943247b7
      Yanqin Jin 提交于
      Summary:
      As title.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10785
      
      Test Plan: CI
      
      Reviewed By: ltamasi
      
      Differential Revision: D40162583
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 4e01f9b682f397130e286cf5d82190b7973fa3c1
      943247b7
    • J
      Use `sstableKeyCompare()` for compaction output boundary check (#10763) · 23fa5b77
      Jay Zhuang 提交于
      Summary:
      To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like:
      ```
       L1: [2-21]  [22-30]
       L2: [1-10] [21-30]
      ```
      Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped.
      `sstableKeyCompare()` also take range delete into consideration which may cut file for the same key.
      It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763
      
      Reviewed By: cbi42
      
      Differential Revision: D39971904
      
      Pulled By: cbi42
      
      fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8
      23fa5b77
    • L
      Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783) · d6d8c007
      Levi Tamasi 提交于
      Summary:
      As the first step of covering the wide-column functionality of iterators
      in our stress tests, the patch adds verification logic to
      `NonBatchedOpsStressTest::VerifyDb` that checks whether the
      iterator's value and columns are in sync. Note: I plan to update the other
      types of stress tests and add similar verification for prefix scans etc.
      in separate PRs.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783
      
      Test Plan: Ran some simple blackbox crash tests.
      
      Reviewed By: riversand963
      
      Differential Revision: D40152370
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35
      d6d8c007
    • P
      Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768) · b205c6d0
      Peter Dillinger 提交于
      Summary:
      We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
      
      This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
      
      Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
      
      Reviewed By: anand1976
      
      Differential Revision: D40034747
      
      Pulled By: anand1976
      
      fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
      b205c6d0
  11. 06 10月, 2022 3 次提交
  12. 05 10月, 2022 5 次提交
  13. 04 10月, 2022 1 次提交
    • P
      Some clean-up of secondary cache (#10730) · 5f4391dd
      Peter Dillinger 提交于
      Summary:
      This is intended as a step toward possibly separating secondary cache integration from the
      Cache implementation as much as possible, to (hopefully) minimize code duplication in
      adding secondary cache support to HyperClockCache.
      * Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles.
      * Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle).
      * Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction.
      * Drop "incomplete" terminology, which was another name for "pending".
      * Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady().
      * Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.)
      * Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().)
      
      Intended follow-up:
      * Clarify APIs in secondary_cache.h
        * Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)?
        * Does Wait() need to be called if IsReady() == true? (This would be different from Cache.)
        * Do Value() and Size() have undefined behavior if IsReady() == false?
        * Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>?
      * Improve unit testing of standalone handle case
      * Apparent null `e` bug in `free_standalone_handle` case
      * Clean up secondary cache testing in lru_cache_test
        * Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle?
        * Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true?
        * (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation?
        * Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady().
      * Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730
      
      Test Plan:
      existing tests (minor update), and crash test with sanitizers and secondary cache
      
      Performance test for any regressions in LRUCache (primary only):
      Create DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test before & after (run at same time) with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16
      ```
      Before: readrandom [AVG    100 runs] : 22234 (± 63) ops/sec;    1.6 (± 0.0) MB/sec
      After: readrandom [AVG    100 runs] : 22197 (± 64) ops/sec;    1.6 (± 0.0) MB/sec
      That's within 0.2%, which is not significant by the confidence intervals.
      
      Reviewed By: anand1976
      
      Differential Revision: D39826010
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44
      5f4391dd