1. 20 11月, 2021 1 次提交
    • L
      Support readahead during compaction for blob files (#9187) · dc5de45a
      Levi Tamasi 提交于
      Summary:
      The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
      that can be used to enable prefetching data from blob files during compaction.
      This is important when using storage with higher latencies like HDDs or remote filesystems.
      If enabled, prefetching is used for all cases when blobs are read during compaction,
      namely garbage collection, compaction filters (when the existing value has to be read from
      a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
      
      Test Plan: Ran `make check` and the stress/crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D32565512
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
      dc5de45a
  2. 12 10月, 2021 1 次提交
    • L
      Make it possible to force the garbage collection of the oldest blob files (#8994) · 3e1bf771
      Levi Tamasi 提交于
      Summary:
      The current BlobDB garbage collection logic works by relocating the valid
      blobs from the oldest blob files as they are encountered during compaction,
      and cleaning up blob files once they contain nothing but garbage. However,
      with sufficiently skewed workloads, it is theoretically possible to end up in a
      situation when few or no compactions get scheduled for the SST files that contain
      references to the oldest blob files, which can lead to increased space amp due
      to the lack of GC.
      
      In order to efficiently handle such workloads, the patch adds a new BlobDB
      configuration option called `blob_garbage_collection_force_threshold`,
      which signals to BlobDB to schedule targeted compactions for the SST files
      that keep alive the oldest batch of blob files if the overall ratio of garbage in
      the given blob files meets the threshold *and* all the given blob files are
      eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
      if the new option is set to 0.9, targeted compactions will get scheduled if the
      sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
      oldest blob files, assuming all affected blob files are below the age-based cutoff.)
      The net result of these targeted compactions is that the valid blobs in the oldest
      blob files are relocated and the oldest blob files themselves cleaned up (since
      *all* SST files that rely on them get compacted away).
      
      These targeted compactions are similar to periodic compactions in the sense
      that they force certain SST files that otherwise would not get picked up to undergo
      compaction and also in the sense that instead of merging files from multiple levels,
      they target a single file. (Note: such compactions might still include neighboring files
      from the same level due to the need of having a "clean cut" boundary but they never
      include any files from any other level.)
      
      This functionality is currently only supported with the leveled compaction style
      and is inactive by default (since the default value is set to 1.0, i.e. 100%).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
      
      Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
      
      Reviewed By: riversand963
      
      Differential Revision: D31489850
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
      3e1bf771
  3. 21 8月, 2021 1 次提交
    • P
      Add Bloom/Ribbon hybrid API support (#8679) · 2a383f21
      Peter Dillinger 提交于
      Summary:
      This is essentially resurrection and fixing of the part of
      https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
      when configuring Ribbon filter, you can specify an LSM level before which
      Bloom will be used instead of Ribbon. But Bloom is only considered for
      Leveled and Universal compaction styles and file going into a known LSM
      level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
      you would expect with NewRibbonFilterPolicy.
      
      So that this can be controlled with a single int value and so that flushes
      can be distinguished from intra-L0, we consider flush to go to level -1 for
      the purposes of this option. (Explained in API comment.)
      
      I also expect the most common and recommended Ribbon configuration to
      use Bloom during flush, to minimize slowing down writes and because according
      to my estimates, Ribbon only pays off if the structure lives in memory for
      more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
      to be this mild hybrid configuration. I don't really want to add something like
      NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
      flush, Ribbon otherwise) should be considered a natural choice.
      
      C APIs also updated, but because they don't support overloading,
      rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
      rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
      configuration. While touching C API, I changed bits per key options from
      int to double.
      
      BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
      unused fields from BloomFilterPolicy.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
      
      Test Plan: new + updated tests, including crash test
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D30445797
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
      2a383f21
  4. 11 8月, 2021 1 次提交
    • B
      Memtable sampling for mempurge heuristic. (#8628) · e3a96c48
      Baptiste Lemaire 提交于
      Summary:
      Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
      This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
      Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
      At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
      Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
      The unit tests have been readapted to support this new API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
      
      Reviewed By: pdillinger
      
      Differential Revision: D30149315
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
      e3a96c48
  5. 07 8月, 2021 1 次提交
  6. 10 7月, 2021 1 次提交
  7. 02 7月, 2021 1 次提交
    • B
      Memtable "MemPurge" prototype (#8454) · 9dc887ec
      Baptiste Lemaire 提交于
      Summary:
      Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
      The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
      Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
      One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
      
      Reviewed By: anand1976
      
      Differential Revision: D29433971
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
      9dc887ec
  8. 18 5月, 2021 1 次提交
  9. 28 4月, 2021 2 次提交
  10. 23 4月, 2021 1 次提交
  11. 16 4月, 2021 1 次提交
    • M
      Add Blob Options to C API (#8148) · 4c41e51c
      mrambacher 提交于
      Summary:
      Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API.
      
      There are no tests for getting/setting options in the C API currently, hence no specific test plans.  Should there be a some?
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148
      
      Reviewed By: ltamasi
      
      Differential Revision: D27568495
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6
      4c41e51c
  12. 10 4月, 2021 1 次提交
  13. 27 3月, 2021 1 次提交
  14. 20 3月, 2021 2 次提交
    • S
      Include C++ standard library headers instead of C compatibility headers (#8068) · d9be6556
      storagezhang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8068
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27147685
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5428b1c0142ecae17c977fba31a6d49b52983d1c
      d9be6556
    • S
      Add default in switch (#8065) · c7063242
      storagezhang 提交于
      Summary:
      switch may not cover all branch in `db/c.cc`:
      
      ```c++
      void rocksdb_options_set_access_hint_on_compaction_start(
          rocksdb_options_t* opt, int v) {
        switch(v) {
          case 0:
            opt->rep.access_hint_on_compaction_start =
                ROCKSDB_NAMESPACE::Options::NONE;
            break;
          case 1:
            opt->rep.access_hint_on_compaction_start =
                ROCKSDB_NAMESPACE::Options::NORMAL;
            break;
          case 2:
            opt->rep.access_hint_on_compaction_start =
                ROCKSDB_NAMESPACE::Options::SEQUENTIAL;
            break;
          case 3:
            opt->rep.access_hint_on_compaction_start =
                ROCKSDB_NAMESPACE::Options::WILLNEED;
            break;
        }
      }
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8065
      
      Reviewed By: riversand963
      
      Differential Revision: D27102892
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: ad1d20d192712878e61597311ba75b55df0066d7
      c7063242
  15. 20 2月, 2021 1 次提交
    • A
      Limit buffering for collecting samples for compression dictionary (#7970) · d904233d
      Andrew Kryczka 提交于
      Summary:
      For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
      
      However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
      
      Related changes include:
      
      - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
      - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
      - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
      
      Test Plan:
      - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
      - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
      
      Reviewed By: pdillinger
      
      Differential Revision: D26467994
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
      d904233d
  16. 05 2月, 2021 1 次提交
  17. 07 1月, 2021 1 次提交
    • A
      Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715) · 6e0f62f2
      Adam Retter 提交于
      Summary:
      Third batch of adding more tests to ASSERT_STATUS_CHECKED.
      
      * db_compaction_filter_test
      * db_compaction_test
      * db_dynamic_level_test
      * db_inplace_update_test
      * db_sst_test
      * db_tailing_iter_test
      * db_io_failure_test
      
      Also update GetApproximateSizes APIs to all return Status.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D25806896
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
      6e0f62f2
  18. 29 10月, 2020 1 次提交
    • Y
      Remove unused includes (#7604) · 394210f2
      Yanqin Jin 提交于
      Summary:
      This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D24579392
      
      Pulled By: riversand963
      
      fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
      394210f2
  19. 17 10月, 2020 1 次提交
  20. 15 10月, 2020 1 次提交
  21. 18 9月, 2020 1 次提交
  22. 10 9月, 2020 1 次提交
  23. 04 9月, 2020 1 次提交
  24. 21 8月, 2020 1 次提交
  25. 30 7月, 2020 1 次提交
  26. 29 7月, 2020 1 次提交
  27. 11 7月, 2020 1 次提交
  28. 09 7月, 2020 1 次提交
  29. 30 6月, 2020 1 次提交
  30. 26 6月, 2020 1 次提交
  31. 04 6月, 2020 2 次提交
  32. 03 6月, 2020 1 次提交
  33. 13 5月, 2020 1 次提交
  34. 08 4月, 2020 1 次提交
  35. 22 2月, 2020 1 次提交
  36. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  37. 04 2月, 2020 1 次提交
    • M
      Add an option to prevent DB::Open() from querying sizes of all sst files (#6353) · 637e64b9
      Mike Kolupaev 提交于
      Summary:
      When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
      
      If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
      
      We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
      
      Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
      
      Differential Revision: D19656425
      
      Pulled By: al13n321
      
      fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
      637e64b9