1. 03 8月, 2023 2 次提交
    • P
      Add some more bit operations to internal APIs (#11660) · f4e4039f
      Peter Dillinger 提交于
      Summary:
      BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h
      
      BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast.
      
      I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660
      
      Test Plan: unit tests added
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D47935531
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6
      f4e4039f
    • A
      Expand Statistics support in the C API (#11263) · 946d1009
      amatveev-cf 提交于
      Summary:
      Adds a few missing features to the C API:
      1) Statistics level
      2) Getting individual values instead of a serialized string
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263
      
      Test Plan: unit tests
      
      Reviewed By: ajkr
      
      Differential Revision: D47309963
      
      Pulled By: hx235
      
      fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248
      946d1009
  2. 02 8月, 2023 1 次提交
    • J
      Use C++17 [[fallthrough]] in transaction_test.cc (#11663) · 9a2a6db2
      Jay Huh 提交于
      Summary:
      (Copied from https://www.internalfb.com/diff/D46606060)
      
      This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer.
      It's currently possible to write code like this:
      ```
      switch(x){
        case 1:
          foo1();
        case 2:
          foo2();
          break;
        case 3:
          foo3();
      }
      ```
      But that's scary because we don't know whether the fallthrough from case 1 was intentional or not.
      The -Wimplicit-fallthrough flag will make this an error. The solution is to either  fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17).
      ```
      switch(x){
        case 1:
          foo1();
          [[fallthrough]]; // Solution if we intended to fallthrough
          break;           // Solution if we did not intend to fallthrough
        case 2:
          foo2();
          break;
        case 3:
          foo3();
      }
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663
      
      Test Plan: Existing tests
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D47961248
      
      Pulled By: jaykorean
      
      fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03
      9a2a6db2
  3. 31 7月, 2023 2 次提交
    • P
      db_stress: Reinstate Transaction::Rollback() calls before destruction (#11656) · bb8fcc00
      Peter Dillinger 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
      Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656
      
      Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn
      
      Reviewed By: cbi42
      
      Differential Revision: D47906280
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
      bb8fcc00
    • P
      format_version=6 and context-aware block checksums (#9058) · 7a1b0207
      Peter Dillinger 提交于
      Summary:
      ## Context checksum
      All RocksDB checksums currently use 32 bits of checking
      power, which should be 1 in 4 billion false negative (FN) probability (failing to
      detect corruption). This is true for random corruptions, and in some cases
      small corruptions are guaranteed to be detected. But some possible
      corruptions, such as in storage metadata rather than storage payload data,
      would have a much higher FN rate. For example:
      * Data larger than one SST block is replaced by data from elsewhere in
      the same or another SST file. Especially with block_align=true, the
      probability of exact block size match is probably around 1 in 100, making
      the FN probability around that same. Without `block_align=true` the
      probability of same block start location is probably around 1 in 10,000,
      for FN probability around 1 in a million.
      
      To solve this problem in new format_version=6, we add "context awareness"
      to block checksum checks. The stored and expected checksum value is
      modified based on the block's position in the file and which file it is in. The
      modifications are cleverly chosen so that, for example
      * blocks within about 4GB of each other are guaranteed to use different context
      * blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
      different context
      * files generated by the same process are guaranteed to use different context
      for the same offsets, until wrap-around after 2^32 - 1 files
      
      Thus, with format_version=6, if a valid SST block and checksum is misplaced,
      its checksum FN probability should be essentially ideal, 1 in 4B.
      
      ## Footer checksum
      This change also adds checksum protection to the SST footer (with
      format_version=6), for the first time without relying on whole file checksum.
      To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
      defeat the footer checksum, we change much of the footer data format
      including an "extended magic number" in format_version 6 that would be
      interpreted as empty index and metaindex block handles in older footer
      versions. We also change the encoding of handles to free up space for
      other new data in footer.
      
      ## More detail: making space in footer
      In order to keep footer the same size in format_version=6 (avoid change to IO
      patterns), we have to free up some space for new data. We do this two ways:
      * Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
      it immediately precedes the footer, and by assuming it is < 4GB.
      * Index block handle is moved into metaindex. (I don't know why it was
      in footer to begin with.)
      
      ## Performance
      In case of small performance penalty, I've made a "pay as you go" optimization
      to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
      with the only field used in that structure after construction: `prefix_extractor`.
      This makes the PR an overall performance improvement (results below).
      
      Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
      even including that improvement for both. That's based on extreme case table
      write performance testing, many files with many blocks. This is relatively
      checksum intensive (small blocks) and salt generation intensive (small files).
      
      ```
      (for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
      awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
      ```
      
      Each value below is ops/s averaged over 100 runs, run simultaneously with competing
      configuration for load fairness
      
      Before -> after (both fv=5): 483530 -> 483673 (negligible)
      Re-run 1: 480733 -> 485427 (1.0% faster)
      Re-run 2: 483821 -> 484541 (0.1% faster)
      Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
      Re-run 1: 482212 -> 485075 (0.6% faster)
      Re-run 2: 483590 -> 484073 (0.1% faster)
      After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
      Re-run 1: 485331 -> 483385 (0.4% slower)
      Re-run 2: 485283 -> 483435 (0.4% slower)
      Re-run 3: 483647 -> 486109 (0.5% faster)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058
      
      Test Plan:
      unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
      and crash test updated to test new format_version.
      
      Also temporarily updated the default format version to 6 and saw some test failures. Almost all
      were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
      though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
      checksum, just assuming it was verified in opening the index reader (probably *usually* true but
      probably not always true). Some other concerns about VerifyChecksum are left in FIXME
      comments. The only remaining test failure on change of default (in block_fetcher_test) now
      has a comment about how to upgrade the test.
      
      The format compatibility test does not need updating because we have not updated the default
      format_version.
      
      Reviewed By: ajkr, mrambacher
      
      Differential Revision: D33100915
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
      7a1b0207
  4. 29 7月, 2023 3 次提交
    • P
      Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653) · b3c54186
      Peter Dillinger 提交于
      Summary:
      In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.
      
      Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
      * (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
      * Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.
      
      Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
      * Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
      * Use unique_ptr (nicely complements removing unnecessary Rollbacks)
      * Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)
      
      Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653
      
      Test Plan:
      this is the test :)
      
      Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.
      
      Reviewed By: cbi42
      
      Differential Revision: D47882995
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
      b3c54186
    • P
      Strip leading and trailing whitespace for unreleased_history entries (#11652) · c205a217
      Peter Dillinger 提交于
      Summary:
      Some trailing whitespace has leaked into HISTORY.md entries. This can lead to unexpected changes when modifying HISTORY.md with hygienic editors (e.g. for a patch release). This change should protect against future cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11652
      
      Test Plan: manual
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D47882814
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 148c3746d3b298cb6e1f655f0416d46619969086
      c205a217
    • C
      Compare the number of input keys and processed keys for compactions (#11571) · 6a0f6376
      Changyu Bi 提交于
      Summary:
      ... to improve data integrity validation during compaction.
      
      A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.
      
      For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.
      
      For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571
      
      Test Plan:
      * Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
      * All other unit tests for non-corrupted case.
      * Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`
      
      Reviewed By: ajkr
      
      Differential Revision: D47131965
      
      Pulled By: cbi42
      
      fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
      6a0f6376
  5. 28 7月, 2023 2 次提交
    • Y
      Add a UDT comparator for ReverseBytewiseComparator to object library (#11647) · 5dd8c114
      Yu Zhang 提交于
      Summary:
      Add a built-in comparator that supports uint64_t style user-defined timestamps for ReverseBytewiseComparator.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11647
      
      Test Plan:
      Added a test wrapper for retrieving this comparator from registry and used it in this test:
      `./udt_util_test`
      
      Reviewed By: ltamasi
      
      Differential Revision: D47848303
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5af5534a8c2d9195997d0308c8e194c1c797548c
      5dd8c114
    • A
      Fix use_after_free bug when underlying FS enables kFSBuffer (#11645) · 63a5125a
      akankshamahajan 提交于
      Summary:
      Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
      Since it's an experimental feature, added a hack for now to fix the bug.
      Planning to make public API change to remove const from the callback as it doesn't make sense to use const.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645
      
      Test Plan: tested locally
      
      Reviewed By: ltamasi
      
      Differential Revision: D47819907
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
      63a5125a
  6. 27 7月, 2023 3 次提交
    • Y
      Support switching on / off UDT together with in-Memtable-only feature (#11623) · c24ef26c
      Yu Zhang 提交于
      Summary:
      Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.
      
      To do this, this PR includes:
      1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.
      
      2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
           a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
           b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
               is false.
           c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
               already false.
      3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.
      
      4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623
      
      Test Plan:
      ```
      make all check
      ./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
      ./db_wal_test --gtest_filter="*EnableDisableUDT*"
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D47636862
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
      c24ef26c
    • Y
      Respect cutoff timestamp during flush (#11599) · 4ea7b796
      Yu Zhang 提交于
      Summary:
      Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.
      
      This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.
      
      For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.
      
      Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.
      
      Tet plan:
      ```
      ./column_family_test --gtest_filter="*RetainUDT*"
      ./memtable_list_test --gtest_filter="*WithTimestamp*"
      ./flush_job_test --gtest_filter="*WithTimestamp*"
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599
      
      Reviewed By: ajkr
      
      Differential Revision: D47561586
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
      4ea7b796
    • C
      Clarify usage for options `ttl` and `periodic_compaction_seconds` for universal compaction (#11552) · 5c2a063c
      Changyu Bi 提交于
      Summary:
      this is stacked on https://github.com/facebook/rocksdb/issues/11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11552
      
      Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault`
      
      Reviewed By: ajkr
      
      Differential Revision: D47381434
      
      Pulled By: cbi42
      
      fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
      5c2a063c
  7. 25 7月, 2023 2 次提交
    • Y
      Fix comment in WriteBatchWithIndex::NewIteratorWithBase (#11636) · 9cc0986a
      ywave 提交于
      Summary:
      Remove obsolete comment.
      
      Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in https://github.com/facebook/rocksdb/pull/8135, as you can clearly see in the HISTORY.md.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11636
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D47722955
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032
      9cc0986a
    • P
      Even more HyperClockCache refactoring (#11630) · c41122b1
      Peter Dillinger 提交于
      Summary:
      ... ahead of dynamic variant.
      
      * Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release.
      * Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function.
      * Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630
      
      Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster.
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D47657334
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
      c41122b1
  8. 22 7月, 2023 2 次提交
    • Z
      Add missing table properties in plaintable GetTableProperties() (#11267) · 1567108f
      zhangyuxiang.ax 提交于
      Summary:
      Plaintable will miss properties.
      It should have some behavior like blockbasedtable.
      Here is a unit test for reproduce this bug.
      
      ```
      #include <gflags/gflags.h>
      #include "rocksdb/db.h"
      #include "rocksdb/options.h"
      #include "rocksdb/table.h"
      #include "rocksdb/slice_transform.h"
      #include <iostream>
      #include <thread>
      #include <csignal>
      const std::string kKey = "key";
      
      DEFINE_bool(use_plaintable, true, "use plain table");
      DEFINE_string(db_path, "/dev/shm/test_zyx_path", "db_path");
      
      rocksdb::DB* db = nullptr;
      
      class NoopTransform : public rocksdb::SliceTransform {
      public:
          explicit NoopTransform() {
          }
      
          virtual const char* Name() const override {
              return "rocksdb.Noop";
          }
      
          virtual rocksdb::Slice Transform(const rocksdb::Slice& src) const override {
              return src;
          }
      
          virtual bool InDomain(const rocksdb::Slice& src) const override {
              return true;
          }
      
          virtual bool InRange(const rocksdb::Slice& dst) const override {
              return true;
          }
      
          virtual bool SameResultWhenAppended(const rocksdb::Slice& prefix) const override {
              return false;
          }
      };
      
      class TestPropertiesCollector : public ::rocksdb::TablePropertiesCollector {
      public:
          explicit TestPropertiesCollector() {
          }
      
      private:
          ::rocksdb::Status AddUserKey(const ::rocksdb::Slice& key, const ::rocksdb::Slice& value, ::rocksdb::EntryType type,
                                       ::rocksdb::SequenceNumber seq, uint64_t file_size) override {
              count++;
              return ::rocksdb::Status::OK();
          }
      
          ::rocksdb::Status Finish(::rocksdb::UserCollectedProperties* properties) override {
              properties->insert({kKey, std::to_string(count)});
              return ::rocksdb::Status::OK();
          }
      
          ::rocksdb::UserCollectedProperties GetReadableProperties() const override {
              ::rocksdb::UserCollectedProperties properties;
              properties.insert({kKey, std::to_string(count)});
              return properties;
          }
      
          const char* Name() const override {
              return "TestPropertiesCollector";
          }
          int count = 0;
      };
      
      class TestTablePropertiesCollectorFactory : public ::rocksdb::TablePropertiesCollectorFactory {
      public:
          explicit TestTablePropertiesCollectorFactory() {
          }
      
      private:
          ::rocksdb::TablePropertiesCollector* CreateTablePropertiesCollector(
                  ::rocksdb::TablePropertiesCollectorFactory::Context context) override {
              return new TestPropertiesCollector();
          }
      
          const char* Name() const override {
              return "test.TablePropertiesCollectorFactory";
          }
      };
      
      class TestFlushListener : rocksdb::EventListener {
      public:
          const char* Name() const override {
              return "TestFlushListener";
          }
          void OnFlushCompleted(rocksdb::DB* /*db*/, const rocksdb::FlushJobInfo& flush_job_info) override {
              if (flush_job_info.table_properties.user_collected_properties.find(kKey) ==
                  flush_job_info.table_properties.user_collected_properties.end()) {
                  std::cerr << "OnFlushCompleted: properties not found" << std::endl;
                  return;
              }
              std::cerr << "OnFlushCompleted: properties found "
                        << flush_job_info.table_properties.user_collected_properties.at(kKey) << std::endl;
          }
          explicit TestFlushListener() {
          }
      };
      
      int main(int argc, char* argv[]) {
          gflags::ParseCommandLineFlags(&argc, &argv, true);
          rocksdb::DBOptions rocksdb_options;
          std::shared_ptr<rocksdb::EventListener> flush_offset;
          rocksdb_options.create_if_missing = true;
          rocksdb_options.create_missing_column_families = true;
          std::shared_ptr<::rocksdb::TablePropertiesCollectorFactory> properties_collector(
                  new TestTablePropertiesCollectorFactory());
          rocksdb::ColumnFamilyOptions cfoptions;
          cfoptions.table_properties_collector_factories.emplace_back(properties_collector);
          std::shared_ptr<rocksdb::EventListener> test_cleaner;
          test_cleaner.reset((rocksdb::EventListener*)new TestFlushListener());
          rocksdb_options.listeners.emplace_back(test_cleaner);
      
          std::vector<rocksdb::ColumnFamilyDescriptor> cf_desc_;
          cf_desc_.emplace_back(rocksdb::kDefaultColumnFamilyName, cfoptions);
          std::vector<rocksdb::ColumnFamilyHandle*> cfhs;
          cfoptions.prefix_extractor.reset(new NoopTransform());
          if (FLAGS_use_plaintable) {
              cfoptions.table_factory.reset(rocksdb::NewPlainTableFactory());
              std::cerr << "use plaintable" << std::endl;
          } else {
              cfoptions.table_factory.reset(rocksdb::NewBlockBasedTableFactory());
              std::cerr << "use blockbasedtable" << std::endl;
          }
      
          auto s = rocksdb::DB::Open(rocksdb_options, FLAGS_db_path, cf_desc_, &cfhs, &db);
          if (s.ok()) {
              rocksdb::WriteOptions wops;
              wops.disableWAL = true;
              for (int i = 0; i < 1000000; i++) {
                  auto status = db->Put(wops, std::to_string(i), std::string(1024, '3'));
                  if (!status.ok()) {
                      std::cerr << "write fail " << status.getState() << std::endl;
                  }
              }
          } else {
              std::cerr << "open rocksdb failed" << s.getState() << std::endl;
          }
          std::this_thread::sleep_for(std::chrono::seconds(1000));
          delete db;
      }
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11267
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D47689943
      
      Pulled By: hx235
      
      fbshipit-source-id: 585589cc48f8b26c7dd2323fc7ac4a0c3d4df6bb
      1567108f
    • H
      Move prefetching responsibility to page cache for compaction read under non... · 629605d6
      Hui Xiao 提交于
      Move prefetching responsibility to page cache for compaction read under non directIO usecase (#11631)
      
      Summary:
      **Context/Summary**
      As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .
      
      **Test**
      Extended the existing UT to cover compaction read case
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631
      
      Reviewed By: ajkr
      
      Differential Revision: D47681437
      
      Pulled By: hx235
      
      fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
      629605d6
  9. 21 7月, 2023 2 次提交
  10. 20 7月, 2023 7 次提交
  11. 19 7月, 2023 4 次提交
  12. 15 7月, 2023 2 次提交
    • A
      Remove reallocation of AlignedBuffer in direct_io sync reads if already aligned (#11600) · 749b179c
      akankshamahajan 提交于
      Summary:
      Remove reallocation of AlignedBuffer in direct_io sync reads in RandomAccessFileReader::Read if buffer passed is already aligned.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11600
      
      Test Plan:
      Setup: `TEST_TMPDIR=./tmp-db/ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -target_file_size_base=1048576 -write_buffer_size=1048576 -compression_type=none`
      Benchmark: `TEST_TMPDIR=./tmp-db/ perf record ./db_bench --cache_size=8388608 --use_existing_db=true --disable_auto_compactions=true --benchmarks=seekrandom --use_direct_reads=true -use_direct_io_for_flush_and_compaction=true -reads=1000 -seek_nexts=1 -max_auto_readahead_size=131072 -initial_auto_readahead_size=16384 -adaptive_readahead=true -num_file_reads_for_auto_readahead=0`
      
      Perf profile-
      Before:
      ```
      8.73% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms
      3.34% db_bench [kernel.vmlinux] [k] filemap_get_read_batch
      ```
      
      After:
      ```
      2.50% db_bench [kernel.vmlinux] [k] filemap_get_read_batch
      2.29% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms
      ```
      
      `make  crash_test -j `with direct_io enabled completed succesfully locally.
      
      Ran few benchmarks with direct_io from seek_nexts varying between 912 to 327680 and different readahead_size parameters and it showed no regression so far.
      
      Reviewed By: ajkr
      
      Differential Revision: D47478598
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 6a48e21cb34696f5d09c22a6311a3a1cb5f9cf33
      749b179c
    • P
      Some small improvements to HyperClockCache (#11601) · b1b6f87f
      Peter Dillinger 提交于
      Summary:
      Stacked on https://github.com/facebook/rocksdb/issues/11572
      * Minimize use of std::function and lambdas to minimize chances of
      compiler heap-allocating closures (unnecessary stress on allocator). It
      appears that converting FindSlot to a template enables inlining the
      lambda parameters, avoiding heap allocations.
      * Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572)
      * Fix handling of rare case of probing all slots, with new unit test.
      (Previously Insert would not roll back displacements in that case, which
      would kill performance if it were to happen.)
      * Add an -early_exit option to cache_bench for gathering memory stats
      before deallocation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601
      
      Test Plan:
      unit test added for probing all slots
      
      ## Seeing heap allocations
      Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache`
      before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the
      interesting bin statistics:
      
      ```
      size  nrequests
      ----  ---------
        32     578460
        64      24340
      8192     578460
      ```
      And after:
      ```
      size  nrequests
      ----  ---------
        32  (insignificant)
        64      24370
      8192     579130
      ```
      
      ## Performance test
      Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench`
      
      Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000`
      in before and after configurations, simultaneously:
      
      ```
      Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442
      After:  Complete in 32.773 s; Rough parallel ops/sec = 2441019
      ```
      
      Reviewed By: jowlyzhang
      
      Differential Revision: D47375092
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410
      b1b6f87f
  13. 14 7月, 2023 1 次提交
  14. 13 7月, 2023 1 次提交
  15. 11 7月, 2023 3 次提交
    • C
      Improve error message when an SST file in MANIFEST is not found (#11573) · 854eb76a
      Changyu Bi 提交于
      Summary:
      I got the following error message when an SST file is recorded in MANIFEST but is missing from the db folder.
      It's confusing in two ways:
      1. The part about file "./074837.ldb" which RocksDB will attempt to open only after ./074837.sst is not found.
      2. The last part about "No such file or directory in file ./MANIFEST-074507" sounds like `074837.ldb` is not found in manifest.
      
      ```
      ldb --hex --db=. get some_key
      
      Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.ldb: No such file or directory in file ./MANIFEST-074507
      ```
      
      Improving the error message a little bit:
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11573
      
      Test Plan:
      run the same command after this PR
      ```
      Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.sst: No such file or directory  The file ./MANIFEST-074507 may be corrupted.
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D47192056
      
      Pulled By: cbi42
      
      fbshipit-source-id: 06863f376cc4455803cffb2250c41399b4c39467
      854eb76a
    • weedge's avatar
      fix: std::optional value() build error on older macOS SDK (#11574) · 1a7c7419
      weedge 提交于
      Summary:
      `PORTABLE=1 USE_SSE=1 USE_PCLMUL=1 WITH_JEMALLOC_FLAG=1 JEMALLOC=1 make static_lib`  on MacOS
      
      clang --version:
      
      Apple clang version 12.0.0 (clang-1200.0.32.29)
      Target: x86_64-apple-darwin22.4.0
      Thread model: posix
      InstalledDir: /Library/Developer/CommandLineTools/usr/bin
      
      compile err like this:
      
      util/udt_util.cc:39:39: error: 'value' is unavailable: introduced in macOS 10.14
        if (running_ts_sz != recorded_ts_sz.value()) {
                                            ^
      /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:944:33: note: 'value' has been explicitly marked
            unavailable here
          constexpr value_type const& value() const&
                                      ^
      util/udt_util.cc:217:62: error: 'value' is unavailable: introduced in macOS 10.14
            *new_key = StripTimestampFromUserKey(key, record_ts_sz.value());
                                                                   ^
      /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:953:27: note: 'value' has been explicitly marked
            unavailable here
          constexpr value_type& value() &
                                ^
      2 errors generated.
      make: *** [util/udt_util.o] Error 1
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11574
      
      Reviewed By: ajkr
      
      Differential Revision: D47269519
      
      Pulled By: cbi42
      
      fbshipit-source-id: da49d90cdf00a0af519f91c0cf7d257401eb395f
      1a7c7419
    • Y
      Handle file boundaries when timestamps should not be persisted (#11578) · f7452634
      Yu Zhang 提交于
      Summary:
      Handle file boundaries `FileMetaData.smallest`, `FileMetaData.largest` for when `persist_user_defined_timestamps` is false:
          1) on the manifest write path, the original user-defined timestamps in file boundaries are stripped. This stripping is done during `VersionEdit::Encode` to limit the effect of the stripping to only the persisted version of the file boundaries.
          2) on the manifest read path during DB open, a a min timestamp is padded to the file boundaries. Ideally, this padding should happen during `VersionEdit::Decode` so that all in memory file boundaries have a compatible user key format as the running user comparator. However, because the user-defined timestamp size information is not available at that time. This change is added to `VersionEditHandler::OnNonCfOperation`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11578
      
      Test Plan:
      ```
      make all check
      ./version_edit_test --gtest_filter="*EncodeDecodeNewFile4HandleFileBoundary*".
      ./db_with_timestamp_basic_test --gtest_filter="*HandleFileBoundariesTest*"
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D47309399
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 21b4d54d2089a62826b31d779094a39cb2bbbd51
      f7452634
  16. 08 7月, 2023 2 次提交
    • Y
      Fix a unit test hole for recovering UDTs with WAL files (#11577) · baf37a0e
      Yu Zhang 提交于
      Summary:
      Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after https://github.com/facebook/rocksdb/issues/11557, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false.  This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior.
      
      This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification.
      
      On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11577
      
      Test Plan: ./db_wal_test
      
      Reviewed By: pdillinger
      
      Differential Revision: D47142892
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d
      baf37a0e
    • C
      Make `rocksdb_options_add_compact_on_deletion_collector_factory` backward compatible (#11593) · 1f410ff9
      Changyu Bi 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/11542 added a parameter to the C API `rocksdb_options_add_compact_on_deletion_collector_factory` which causes some internal builds to fail. External users using this API would also require code change. Making the API backward compatible by restoring the old C API and add the parameter to a new C API `rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio`.
      
      Also updated change log for 8.4 and will backport this change to 8.4 branch once landed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11593
      
      Test Plan: `make c_test && ./c_test`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D47299555
      
      Pulled By: cbi42
      
      fbshipit-source-id: 517dc093ef4cf02cac2fe4af4f1af13754bbda63
      1f410ff9
  17. 06 7月, 2023 1 次提交
    • C
      Deprecate option `periodic_compaction_seconds` for FIFO compaction (#11550) · df082c8d
      Changyu Bi 提交于
      Summary:
      both options `ttl` and `periodic_compaction_seconds` have the same meaning for FIFO compaction, which is redundant and can be confusing to use. For example, setting TTL to 0 does not disable TTL: user needs to also set periodic_compaction_seconds to 0. Another example is that dynamically setting `periodic_compaction_seconds` (surprisingly) has no effect on TTL compaction. This is because FIFO compaction picker internally only looks at value of `ttl`. The value of `ttl` is in `SanitizeOptions()` which take into account the value of `periodic_compaction_seconds`, but dynamically setting an option does not invoke this method.
      
      This PR clarifies the usage of both options for FIFO compaction: only `ttl` should be used, `periodic_compaction_seconds` will not have any effect on FIFO compaction.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11550
      
      Test Plan:
      - updated existing unit test `DBOptionsTest.SanitizeFIFOPeriodicCompaction`
      - checked existing values of both options in feature matrix: https://fburl.com/daiquery/xxd0gs9w. All current uses cases either have `periodic_compaction_seconds = 0` or have `periodic_compaction_seconds > ttl`, so should not cause change of behavior.
      
      Reviewed By: ajkr
      
      Differential Revision: D46902959
      
      Pulled By: cbi42
      
      fbshipit-source-id: a9ede235b276783b4906aaec443551fa62ceff4c
      df082c8d