1. 18 3月, 2023 1 次提交
    • A
      Ignore async_io ReadOption if FileSystem doesn't support it (#11296) · eac6b6d0
      anand76 提交于
      Summary:
      In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
      
      Test Plan: Add new unit tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44045776
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
      eac6b6d0
  2. 03 11月, 2022 1 次提交
    • A
      Ran clang-format on db/ directory (#10910) · 5cf6ab6f
      Andrew Kryczka 提交于
      Summary:
      Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
      
      Reviewed By: riversand963
      
      Differential Revision: D40880683
      
      Pulled By: ajkr
      
      fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
      5cf6ab6f
  3. 05 10月, 2022 1 次提交
  4. 27 9月, 2022 1 次提交
    • C
      Fix segfault in Iterator::Refresh() (#10739) · df492791
      Changyu Bi 提交于
      Summary:
      when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739
      
      Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue.
      
      Reviewed By: ajkr, riversand963
      
      Differential Revision: D39825283
      
      Pulled By: cbi42
      
      fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c
      df492791
  5. 22 9月, 2022 2 次提交
  6. 03 9月, 2022 1 次提交
    • C
      Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) · 30bc495c
      Changyu Bi 提交于
      Summary:
      Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
      
      With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
      - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
      - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
      
      This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
      
      One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
      
      Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
      
      Test Plan:
      - Added many unit tests in db_range_del_test
      - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
      - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
      
      ```
      python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
      ```
      
      - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
      ```
      # Setup:
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
      
      # Scan entire DB
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
      
      # Short range scan (10 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
      
      # Long range scan(1000 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
      ```
      Avg over of 10 runs (some slower tests had fews runs):
      
      For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
      
      - Scan entire DB
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
      | 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
      | 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
      | 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
      | 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
      - Short range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
      | 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
      | 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
      | 10000 |1367                 |28667                |+1997.07%            |
      | 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |
      
      - Long range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
      | 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
      | 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
      | 10000 |25                   |414                  |+1556.0%             |
      | 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |
      
      - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
      
      Reviewed By: ajkr
      
      Differential Revision: D38450331
      
      Pulled By: cbi42
      
      fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
      30bc495c
  7. 06 8月, 2022 1 次提交
    • C
      Fragment memtable range tombstone in the write path (#10380) · 9d77bf8f
      Changyu Bi 提交于
      Summary:
      - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
      - db_bench is updated to print out the number of range deletions executed if there is any.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
      
      Test Plan:
      - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
      - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.
      
      ```
      single thread:
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
      
      multi_thread
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
      ```
      Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
      Results are averaged over 5 runs.
      
      Single thread result:
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
      | 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
      | 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
      | 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
      | 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |
      
      32-thread result: note that "Max # tombstones" is per thread.
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
      | 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
      | 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
      | 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |
      
      Reviewed By: ajkr
      
      Differential Revision: D37916564
      
      Pulled By: cbi42
      
      fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
      9d77bf8f
  8. 07 5月, 2022 1 次提交
    • S
      Remove own ToString() (#9955) · 736a7b54
      sdong 提交于
      Summary:
      ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
      
      Test Plan: Watch CI tests
      
      Reviewed By: riversand963
      
      Differential Revision: D36176799
      
      fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
      736a7b54
  9. 16 3月, 2022 1 次提交
  10. 17 6月, 2021 1 次提交
    • M
      Rename ImmutableOptions variables (#8409) · d5bd0039
      mrambacher 提交于
      Summary:
      This is the next part of the ImmutableOptions cleanup.  After changing the use of ImmutableCFOptions to ImmutableOptions, there were places in the code that had did something like "ImmutableOptions* immutable_cf_options", where "cf" referred to the "old" type.
      
      This change simply renames the variables to match the current type.  No new functionality is introduced.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8409
      
      Reviewed By: pdillinger
      
      Differential Revision: D29166248
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 96de97f8e743f5c5160f02246e3ed8269556dc6f
      d5bd0039
  11. 06 5月, 2021 1 次提交
    • M
      Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262) · 8948dc85
      mrambacher 提交于
      Summary:
      The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
      
      Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
      
      Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
      
      Reviewed By: pdillinger
      
      Differential Revision: D28226540
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
      8948dc85
  12. 05 12月, 2020 1 次提交
    • L
      Add blob support to DBIter (#7731) · 61932cdf
      Levi Tamasi 提交于
      Summary:
      The patch adds iterator support to the integrated BlobDB implementation.
      Whenever a blob reference is encountered during iteration, the corresponding
      blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
      (formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
      BlobDB implementation, which has its own blob file handling/blob retrieval logic.)
      
      In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
      message `"BlobDB does not support merge operator."` when encountering a
      blob reference while performing a merge (instead of potentially returning a
      message that implies the database should be opened using the stacked BlobDB's
      `Open`.)
      
      TODO: We can implement support for lazily retrieving the blob value (or in other
      words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
      API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
      supports lazy values).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D25256293
      
      Pulled By: ltamasi
      
      fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
      61932cdf
  13. 04 8月, 2020 1 次提交
    • A
      dedup ReadOptions in iterator hierarchy (#7210) · a4a4a2da
      Andrew Kryczka 提交于
      Summary:
      Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
      and every `LevelIterator`. This redundancy consumes extra memory,
      resulting in the `Arena` making more allocations, and iteration
      observing worse cache performance.
      
      This PR migrates callers of `NewInternalIterator()` and
      `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
      outlive the returned iterator. When the iterator's lifetime will be managed by the
      user, this lifetime guarantee is achieved by storing the `ReadOptions`
      value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
      `MakeInputIterator()` can hold a reference-to-const `ReadOptions`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210
      
      Test Plan:
      - `make check` under ASAN and valgrind
      - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.
      
      Reviewed By: anand1976
      
      Differential Revision: D22861323
      
      Pulled By: ajkr
      
      fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
      a4a4a2da
  14. 19 6月, 2020 1 次提交
  15. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  16. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  17. 18 12月, 2019 1 次提交
    • delete superversions in BackgroundCallPurge (#6146) · 39fcaf82
      解轶伦 提交于
      Summary:
      I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
      
      Then I found "delete sv" in ~SuperVersion() takes the time.
      
      The backtrace looks like this
      
      DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
      DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
      
      I think it's better to delete in a background thread,  please review it。
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
      
      Differential Revision: D18972066
      
      fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
      39fcaf82
  18. 20 9月, 2019 1 次提交
  19. 14 9月, 2019 1 次提交