1. 31 1月, 2023 2 次提交
    • S
      DB Stress to fix a false assertion (#11164) · 36174d89
      sdong 提交于
      Summary:
      Seeting this error in stress test:
      
      db_stress: internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2459: void rocksdb::StressTest::Open(rocksdb::SharedState *): Assertion `txn_db_ == nullptr' failed. Received signal 6 (Aborted)
      ......
      
      It doesn't appear that txn_db_ is set to nullptr at all. We set ithere.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11164
      
      Test Plan: Run db_stress transaction and non-transation with low kill rate and see restarting without assertion
      
      Reviewed By: ajkr
      
      Differential Revision: D42855662
      
      fbshipit-source-id: 06816d37cce9c94a81cb54ab238fb73aa102ed46
      36174d89
    • Y
      Use user key on sst file for blob verification for Get and MultiGet (#11105) · 24ac53d8
      Yu Zhang 提交于
      Summary:
      Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
      
      Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
      
      Test Plan:
      make V=1 db_blob_basic_test
      ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
      
      Reviewed By: ltamasi
      
      Differential Revision: D42716487
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
      24ac53d8
  2. 28 1月, 2023 3 次提交
    • A
      Move ExternalSSTTestEnv to FileSystemWrapper (#11139) · 79e57a39
      akankshamahajan 提交于
      Summary:
      Migrate ExternalSSTTestEnv to FileSystemWrapper
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11139
      
      Reviewed By: anand1976
      
      Differential Revision: D42780180
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 9a4448c9fe5186b518235fe11e1a34dcad897cdd
      79e57a39
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
    • Y
      Remove deprecated util functions in options_util.h (#11126) · 6943ff6e
      Yu Zhang 提交于
      Summary:
      Remove the util functions in options_util.h that have previously been marked deprecated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126
      
      Test Plan: `make check`
      
      Reviewed By: ltamasi
      
      Differential Revision: D42757496
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
      6943ff6e
  3. 27 1月, 2023 6 次提交
  4. 26 1月, 2023 7 次提交
    • A
      Migrate TestEnv in listener_test.cc to FileSystemWrapper (#11125) · 986c5b9d
      akankshamahajan 提交于
      Summary:
      Migrate derived classes from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11125
      
      Test Plan: CircleCI jobs
      
      Reviewed By: anand1976
      
      Differential Revision: D42732241
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: c89a70a79fcfb13e158bf8919b1a87a9de133222
      986c5b9d
    • S
      Remove Stats related to compressed block cache (#11135) · e808858a
      sdong 提交于
      Summary:
      Since compressed block cache is removed, those stats are not needed. They are removed in different PR in case there is a problem with it. The stats are removed in the same way in https://github.com/facebook/rocksdb/pull/11131/ . HISTORY.md was already updated by mistake, and it would be correct after merging this PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11135
      
      Test Plan: Watch CI
      
      Reviewed By: ltamasi
      
      Differential Revision: D42757616
      
      fbshipit-source-id: bd7cb782585c8535ce5784295225c376f3011f35
      e808858a
    • L
      Remove more obsolete statistics (#11131) · 6da2e20d
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11131
      
      Test Plan: `make check`
      
      Reviewed By: pdillinger
      
      Differential Revision: D42753997
      
      Pulled By: ltamasi
      
      fbshipit-source-id: ce8b84c1e55374257e93ed74fd255c9b759723ce
      6da2e20d
    • H
      Fix build with gcc 13 by including <cstdint> (#11118) · 88edfbfb
      Heiko Becker 提交于
      Summary:
      Like other versions before, gcc 13 moved some includes around and as a result <cstdint> is no longer transitively included [1]. Explicitly include it for uint{32,64}_t.
      
      [1] https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11118
      
      Reviewed By: cbi42
      
      Differential Revision: D42711356
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5ea257b85b7017f40fd8fdbce965336da95c55b2
      88edfbfb
    • A
      Support PutEntity in trace analyzer (#11127) · 6a5071ce
      Andrew Kryczka 提交于
      Summary:
      Add the most basic support such that trace_analyzer commands no longer fail with
      ```
      Cannot process the write batch in the trace
      Cannot process the TraceRecord
      PutEntityCF not implemented
      Cannot process the trace
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11127
      
      Reviewed By: cbi42
      
      Differential Revision: D42732319
      
      Pulled By: ajkr
      
      fbshipit-source-id: 162d8a31318672a46539b1b042ec25f69b25c4ed
      6a5071ce
    • P
      Fix DelayWrite() calls for two_write_queues (#11130) · 546e213c
      Peter Dillinger 提交于
      Summary:
      PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.
      
      This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.
      
      HISTORY not updated because this is intended for the same release where the bug was introduced.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130
      
      Test Plan:
      Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.
      
      Will also run crash_test_with_multiops_wc_txn for a while looking for issues.
      
      Reviewed By: ajkr
      
      Differential Revision: D42749330
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b
      546e213c
    • P
      Remove deprecated Env::LoadEnv() (#11121) · 9afa0f05
      Peter Dillinger 提交于
      Summary:
      Can use Env::CreateFromString() instead
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11121
      
      Test Plan: unit tests updated
      
      Reviewed By: cbi42
      
      Differential Revision: D42723813
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5d4b5b10225dfdaf662f5f8049ee965a05d3edc9
      9afa0f05
  5. 25 1月, 2023 6 次提交
    • L
      Remove some deprecated/obsolete statistics from the API (#11123) · 99e55953
      Levi Tamasi 提交于
      Summary:
      These tickers/histograms have been obsolete (and not populated) for a long time.
      The patch removes them from the API completely. Note that this means that the
      numeric values of the remaining tickers change in the C++ code as they get shifted up.
      This should be OK: the values of some existing tickers have changed many times
      over the years as items have been added in the middle. (In contrast, the convention
      in the Java bindings is to keep the ids, which are not guaranteed to be the same
      as the ids on the C++ side, the same across releases.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11123
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42727793
      
      Pulled By: ltamasi
      
      fbshipit-source-id: e058a155a20b05b45f53e67ee380aece1b43b6c5
      99e55953
    • A
      Migrate ErrorEnv from EnvWrapper to FileSystemWrapper (#11124) · bcbab59c
      anand76 提交于
      Summary:
      Migrate ErrorEnv from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11124
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42727791
      
      Pulled By: anand1976
      
      fbshipit-source-id: e8362ad624dc28e55c99fc35eda12866755f62c6
      bcbab59c
    • S
      Remove compressed block cache (#11117) · 2800aa06
      sdong 提交于
      Summary:
      Compressed block cache is replaced by compressed secondary cache. Remove the feature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117
      
      Test Plan: See CI passes
      
      Reviewed By: pdillinger
      
      Differential Revision: D42700164
      
      fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
      2800aa06
    • P
      A better contract for best_efforts_recovery (#11085) · 4a918534
      Peter Dillinger 提交于
      Summary:
      Capture more of the original intent at a high level, without getting bogged down in low-level details.
      
      The old text made some weak promises about handling of LOCK files. There should be no specific concern for LOCK files, because we already rely on LockFile() to create the file if it's not present already. And the lock file is generally size 0, so don't have to worry about truncation. Added a unit test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11085
      
      Test Plan: existing tests, and a new one.
      
      Reviewed By: siying
      
      Differential Revision: D42713233
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2fce7c974d35fac065037c9c4c7326a59c9fe340
      4a918534
    • C
      Improve documentation for `allow_ingest_behind` (#11119) · e0ea0dc6
      Changyu Bi 提交于
      Summary:
      update documentation to mention that only universal compaction is supported.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11119
      
      Reviewed By: ajkr
      
      Differential Revision: D42715986
      
      Pulled By: cbi42
      
      fbshipit-source-id: 91b145d3318334cb92857c5c0ffc0efed6fa4363
      e0ea0dc6
    • H
      Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job... · 86fa2592
      Hui Xiao 提交于
      Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)
      
      Summary:
      **Context:**
      Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush  `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `
      
      **Summary:**
      Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)
      
      **Tets:**
      - new unit test
      - make check
      - aggressive crash test rehearsal
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111
      
      Reviewed By: ajkr
      
      Differential Revision: D42644600
      
      Pulled By: hx235
      
      fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf
      86fa2592
  6. 24 1月, 2023 1 次提交
  7. 21 1月, 2023 2 次提交
    • A
      Add API to limit blast radius of merge operator failure (#11092) · b7fbcefd
      Andrew Kryczka 提交于
      Summary:
      Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.
      
      To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092
      
      Reviewed By: siying
      
      Differential Revision: D42525673
      
      Pulled By: ajkr
      
      fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
      b7fbcefd
    • A
      Enhance async scan prefetch unit tests (#11087) · bde65052
      akankshamahajan 提交于
      Summary:
      Add more coverage in unit tests for async scan. The added unit test fails without PR https://github.com/facebook/rocksdb/pull/10939.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11087
      
      Test Plan: CircleCI jobs status for new unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D42487931
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: d59ed7666599bd0d2733ac5d76bd70984b54c5a9
      bde65052
  8. 20 1月, 2023 3 次提交
    • qq_39353189's avatar
      Fix error maybe-uninitialized #11100 (#11101) · f4a5446c
      qq_39353189 提交于
      Summary:
      In this issue [11100](https://github.com/facebook/rocksdb/issues/11100)
      I try to upgrade dependencies of [BaikalDB](https://github.com/baidu/BaikalDB) and tool chain to gcc-12.I found that when I build rocksdb v6.26.0(maybe I can use newer version),I found that in file trace_replay/trace_replay.cc,the compiler tell me "error mybe-uninitialized".I dound that it can be fixed very easy,so I make this pull request.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11101
      
      Reviewed By: ajkr
      
      Differential Revision: D42583031
      
      Pulled By: cbi42
      
      fbshipit-source-id: 7f399f09441a30fe88b83cec5e2fd9885bad5c06
      f4a5446c
    • L
      remove unused InternalIteratorBase::is_mutable_ (#11104) · a5bcbcd8
      leipeng 提交于
      Summary:
      `InternalIteratorBase::is_mutable_` is not used any more, remove it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11104
      
      Reviewed By: ajkr
      
      Differential Revision: D42582747
      
      Pulled By: cbi42
      
      fbshipit-source-id: d30bf75151fc8414df0ae112a6ec4943b5b7330b
      a5bcbcd8
    • P
      Upgrade xxhash.h to latest dev (#11098) · fd911f96
      Peter Dillinger 提交于
      Summary:
      Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines.
      
      I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example
      * xxhash.h has its own fallthrough marker (which I hope works for us)
      * As in https://github.com/Cyan4973/xxHash/pull/549
      
      Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)`
      
      How I did the upgrade (for future reference), so that I could use usual merge conflict resolution:
      ```
      # New branch to help with merging
      git checkout -b xxh_merge_base
      # Check out RocksDB revision before last xxhash.h upgrade
      git reset --hard 22161b75^
      # Create a commit with the raw base version from xxHash repo (from xxHash repo)
      git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h
      # In RocksDB repo
      git commit -a
      # Merge in the last xxhash.h upgrade
      git merge 22161b75
      # Resolve conflict using committed version
      git show 22161b75:util/xxhash.h > util/xxhash.h
      git commit -a
      # Catch up to upstream
      git merge upstream/main
      
      # Create a different branch for applying raw upgrade
      git checkout -b xxh_upgrade_2023
      # Find the RocksDB commit we made for the raw base version from xxHash
      git log main..HEAD
      # Rewind to it
      git reset --hard 2428b727
      # Copy in latest raw version (from xxHash repo)
      cat xxhash.h > ../rocksdb/util/xxhash.h
      # Merge in RocksDB changes, use typical tools for conflict resolution
      git merge xxh_merge_base
      ```
      
      Branch https://github.com/facebook/rocksdb/tree/xxhash_merge_base can be used as a base for future xxhash merges.
      
      Fixes https://github.com/facebook/rocksdb/issues/11073
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11098
      
      Test Plan:
      existing tests (e.g. Bloom filter schema stability tests)
      
      Also seems to include a small performance boost on my Intel dev machine, using `./db_bench --benchmarks=xxh3[-X50] 2>&1 | egrep -o 'operations;.*' | sort`
      
      Fastest out of 50 runs, before: 15477.3 MB/s
      Fastest out of 50 runs, after: 15850.7 MB/s, and 11 more runs faster than the "before" number
      
      Slowest out of 50 runs, before: 12267.5 MB/s
      Slowest out of 50 runs, after: 13897.1 MB/s
      
      More repetitions show the distinction is repeatable
      
      Reviewed By: hx235
      
      Differential Revision: D42560010
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c43ee52f1c5fe0ba3d6d6e4eebb22ded5f5492ea
      fd911f96
  9. 19 1月, 2023 2 次提交
  10. 18 1月, 2023 2 次提交
    • C
      Consider TTL compaction file cutting earlier to prevent small output file (#11075) · 4d0f9a99
      Changyu Bi 提交于
      Summary:
      in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075
      
      Test Plan: - Added new unit test.
      
      Reviewed By: hx235
      
      Differential Revision: D42398739
      
      Pulled By: cbi42
      
      fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7
      4d0f9a99
    • C
      Avoid counting extra range tombstone compensated size in `AddRangeDels()` (#11091) · 6a82b687
      Changyu Bi 提交于
      Summary:
      in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091
      
      Test Plan:
      - Added unit tests
      - Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10`
      
      Reviewed By: ajkr
      
      Differential Revision: D42521588
      
      Pulled By: cbi42
      
      fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8
      6a82b687
  11. 14 1月, 2023 3 次提交
  12. 13 1月, 2023 1 次提交
  13. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  14. 06 1月, 2023 1 次提交