1. 18 3月, 2023 1 次提交
    • A
      Ignore async_io ReadOption if FileSystem doesn't support it (#11296) · eac6b6d0
      anand76 提交于
      Summary:
      In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
      
      Test Plan: Add new unit tests
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D44045776
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
      eac6b6d0
  2. 03 2月, 2023 1 次提交
    • A
      Return any errors returned by ReadAsync to the MultiGet caller (#11171) · 63da9cfa
      anand76 提交于
      Summary:
      Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.
      
      Tests:
      Update existing unit tests and add a new one for this scenario
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D42950057
      
      Pulled By: anand1976
      
      fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869
      63da9cfa
  3. 28 1月, 2023 1 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
  4. 25 1月, 2023 2 次提交
    • S
      Remove compressed block cache (#11117) · 2800aa06
      sdong 提交于
      Summary:
      Compressed block cache is replaced by compressed secondary cache. Remove the feature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117
      
      Test Plan: See CI passes
      
      Reviewed By: pdillinger
      
      Differential Revision: D42700164
      
      fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
      2800aa06
    • P
      A better contract for best_efforts_recovery (#11085) · 4a918534
      Peter Dillinger 提交于
      Summary:
      Capture more of the original intent at a high level, without getting bogged down in low-level details.
      
      The old text made some weak promises about handling of LOCK files. There should be no specific concern for LOCK files, because we already rely on LockFile() to create the file if it's not present already. And the lock file is generally size 0, so don't have to worry about truncation. Added a unit test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11085
      
      Test Plan: existing tests, and a new one.
      
      Reviewed By: siying
      
      Differential Revision: D42713233
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2fce7c974d35fac065037c9c4c7326a59c9fe340
      4a918534
  5. 12 1月, 2023 1 次提交
    • P
      Major Cache refactoring, CPU efficiency improvement (#10975) · 9f7801c5
      Peter Dillinger 提交于
      Summary:
      This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
      
      The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
      
      * static_cast lines of code +29 -35 (net removed 6)
      * reinterpret_cast lines of code +6 -32 (net removed 26)
      
      ## cache.h and secondary_cache.h
      * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
        * Simpler for implementations to deal with just one Insert and one Lookup.
        * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
        * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
        * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
        * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
        * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
      * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
      * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
      * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
      * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
      * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
      * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
      
      ## typed_cache.h
      Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
      
      The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
      * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
      * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
      * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
      * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
      
      These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
      
      Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
      
      ## block_cache.h
      This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
      
      ## block_based_table_reader.cc
      Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
      
      The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
      
      ## block_based_table_builder.cc, cache_dump_load_impl.cc
      Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
      
      ## Everything else
      Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
      
      Test Plan:
      tests updated
      
      Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
      
      34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
      34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
      34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
      34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
      34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
      34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
      34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
      34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
      233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
      233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
      233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
      233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
      233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
      233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
      233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
      233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
      1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
      1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
      1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
      1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
      1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
      1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
      1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
      1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
      
      Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
      
      Reviewed By: anand1976
      
      Differential Revision: D42417818
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
      9f7801c5
  6. 05 12月, 2022 1 次提交
    • A
      Fix table cache leak in MultiGet with async_io (#10997) · 8ffabdc2
      anand76 提交于
      Summary:
      When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997
      
      Test Plan:
      1. New unit test in db_basic_test
      2. asan_crash
      
      Reviewed By: pdillinger
      
      Differential Revision: D41587244
      
      Pulled By: anand1976
      
      fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64
      8ffabdc2
  7. 03 11月, 2022 1 次提交
    • A
      Ran clang-format on db/ directory (#10910) · 5cf6ab6f
      Andrew Kryczka 提交于
      Summary:
      Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
      
      Reviewed By: riversand963
      
      Differential Revision: D40880683
      
      Pulled By: ajkr
      
      fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
      5cf6ab6f
  8. 15 9月, 2022 1 次提交
  9. 01 9月, 2022 1 次提交
  10. 30 8月, 2022 1 次提交
    • P
      Don't wait for indirect flush in read-only DB (#10569) · c5afbbfe
      Peter Dillinger 提交于
      Summary:
      Some APIs for getting live files, which are used by Checkpoint
      and BackupEngine, can optionally trigger and wait for a flush. These
      would deadlock when used on a read-only DB. Here we fix that by assuming
      the user wants the overall operation to succeed and is OK without
      flushing (because the DB is read-only).
      
      Follow-up work: the same or other issues can be hit by directly invoking
      some DB functions that are clearly not appropriate for read-only
      instance, but are not covered by overrides in DBImplReadOnly and
      CompactedDBImpl. These should be fixed to avoid similar problems on
      accidental misuse. (Long term, it would be nice to have a DBReadOnly
      class without those members, like BackupEngineReadOnly.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569
      
      Test Plan: tests updated to catch regression (hang before the fix)
      
      Reviewed By: riversand963
      
      Differential Revision: D38995759
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68
      c5afbbfe
  11. 20 8月, 2022 1 次提交
    • A
      MultiGet async IO across multiple levels (#10535) · 35cdd3e7
      anand76 提交于
      Summary:
      This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
      
      Test Plan:
      1. Ensure existing MultiGet unit tests pass, updating them as necessary
      2. New unit tests - TODO
      3. Run stress test - TODO
      
      No noticeable regression (<1%) without async IO -
      Without PR: `multireadrandom :       7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations;  571.6 MB/s (8168992 of 8168992 found)`
      With PR: `multireadrandom :       7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations;  568.2 MB/s (8271992 of 8271992 found)`
      
      For a fully cached DB, but with async IO option on, no regression observed (<1%) -
      Without PR: `multireadrandom :       5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations;  797.9 MB/s (11540992 of 11540992 found) `
      With PR: `multireadrandom :       5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations;  790.7 MB/s (11649992 of 11649992 found) `
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38774009
      
      Pulled By: anand1976
      
      fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
      35cdd3e7
  12. 18 8月, 2022 1 次提交
    • A
      Fix range deletion handling in async MultiGet (#10534) · 65814a4a
      anand76 提交于
      Summary:
      The fix in https://github.com/facebook/rocksdb/issues/10513 was not complete w.r.t range deletion handling. It didn't handle the case where a file with a range tombstone covering a key also overlapped another key in the batch. In that case, ```mget_range``` would be non-empty. However, ```mget_range``` would only have the second key and, therefore, the first key would be skipped when iterating through the range tombstones in ```TableCache::MultiGet```.
      
      Test plan -
      1. Add a unit test
      2. Run stress tests
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10534
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38773880
      
      Pulled By: anand1976
      
      fbshipit-source-id: dae491dbe52e18bbce5179b77b63f20771a66c00
      65814a4a
  13. 10 8月, 2022 1 次提交
    • A
      Fix MultiGet range deletion handling and a memory leak (#10513) · 0b02960d
      anand76 提交于
      Summary:
      This PR fixes 2 bugs introduced in https://github.com/facebook/rocksdb/issues/10432 -
      1. If the bloom filter returned a negative result for all MultiGet keys in a file, the range tombstones in that file were being ignored, resulting in incorrect results if those tombstones covered a key in a higher level.
      2. If all the keys in a file were filtered out in `TableCache::MultiGetFilter`, the table cache handle was not being released.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10513
      
      Test Plan: Add a new unit test that fails without this fix
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38548739
      
      Pulled By: anand1976
      
      fbshipit-source-id: a741a1e25d2e991d63f038100f126c2dc404a87c
      0b02960d
  14. 05 8月, 2022 1 次提交
    • A
      Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb
      anand76 提交于
      Summary:
      This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38245910
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
      bf4532eb
  15. 03 8月, 2022 1 次提交
    • P
      Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59
      Peter Dillinger 提交于
      Summary:
      TL;DR: due to a recent change, if you drop a column family,
      often that DB will no longer fsync after writing new SST files
      to remaining or new column families, which could lead to data
      loss on power loss.
      
      More bug detail:
      The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
      DB::Close time rather than waiting for DB object destruction.
      Unfortunately, it also closes shared FSDirectory objects on
      DropColumnFamily (& destroy remaining handles), which can lead
      to use-after-Close on FSDirectory shared with remaining column
      families. Those "uses" are only Fsyncs (or redundant Closes). In
      the default Posix filesystem, an Fsync on a closed FSDirectory is a
      quiet no-op. Consequently (under most configurations), if you drop
      a column family, that DB will no longer fsync after writing new SST
      files to column families sharing the same directory (true under most
      configurations).
      
      More fix detail:
      Basically, this removes unnecessary Close ops on destroying
      ColumnFamilyData. We let `shared_ptr` take care of calling the
      destructor at the right time. If the intent was to require Close be
      called before destroying FSDirectory, that was not made clear by the
      author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
      could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
      not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
      timely destruction of FSDirectory to suffice as Close (in
      CountedFileSystem). Added a TODO to revisit that.
      
      Also in this PR:
      * Added a TODO to share FSDirectory instances between DB and its column
      families. (Already shared among column families.)
      * Made DB::Close attempt to close all its open FSDirectory objects even
      if there is a failure in closing one. Also code clean-up around this
      logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
      
      Test Plan:
      add an assert to check for use-after-Close. With that
      existing tests can detect the misuse. With fix, tests pass (except noted
      relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
      
      Reviewed By: ajkr
      
      Differential Revision: D38357922
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
      27f3af59
  16. 16 7月, 2022 1 次提交
  17. 13 7月, 2022 1 次提交
    • Y
      Stop tracking syncing live WAL for performance (#10330) · b283f041
      Yanqin Jin 提交于
      Summary:
      With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer
      from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs.
      After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to
      detect corruption in synced WALs if the corruption is in the live WAL.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330
      
      Test Plan:
      make check
      
      Before https://github.com/facebook/rocksdb/issues/10087
      ```bash
      fillsync     :     750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync     :     776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 2 runs] : 1310 (± 44) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync     :     805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [AVG    3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [MEDIAN 3 runs] : 1287 ops/sec;    0.1 MB/sec
      ```
      
      Before this PR and after https://github.com/facebook/rocksdb/issues/10087
      ```bash
      fillsync     :    1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync     :    1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 2 runs] : 645 (± 59) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync     :    1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [AVG    3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [MEDIAN 3 runs] : 629 ops/sec;    0.1 MB/sec
      ```
      
      After this PR
      ```bash
      fillsync     :     749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync     :     865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 2 runs] : 1244 (± 175) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync     :     845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [AVG    3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [MEDIAN 3 runs] : 1182 ops/sec;    0.1 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D37725212
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae
      b283f041
  18. 24 6月, 2022 1 次提交
    • P
      Don't count no prefix as Bloom hit (#10244) · f81ea75d
      Peter Dillinger 提交于
      Summary:
      When a key is "out of domain" for the prefix_extractor (no
      prefix assigned) then the Bloom filter is not queried. PerfContext
      was counting this as a Bloom "hit" while Statistics doesn't count this
      as a prefix Bloom checked. I think it's more accurate to call it neither
      hit nor miss, so changing the counting to make it PerfContext coounting
      more like Statistics.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244
      
      Test Plan:
      tests updates and expanded (Get and MultiGet). Iterator test
      coverage of the change will come in next PR
      
      Reviewed By: bjlemaire
      
      Differential Revision: D37371297
      
      Pulled By: pdillinger
      
      fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157
      f81ea75d
  19. 22 6月, 2022 1 次提交
    • Y
      Expose the initial logger creation error (#10223) · 9586dcf1
      Yanqin Jin 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
      `DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
      directly expose the error to caller without some additional work.
      This is a first version proposal which:
      - Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
      - Checks the error during `DB::Open()` and return it to caller if non-ok
      
      This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
      Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
      in case other callers of `SanitizeOptions()` assumes that a logger should be created.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223
      
      Test Plan: make check
      
      Reviewed By: pdillinger
      
      Differential Revision: D37321717
      
      Pulled By: riversand963
      
      fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
      9586dcf1
  20. 17 6月, 2022 1 次提交
    • A
      Update stats to help users estimate MultiGet async IO impact (#10182) · a6691d0f
      anand76 提交于
      Summary:
      Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements -
      1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file
      2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used
      
      The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37213296
      
      Pulled By: anand1976
      
      fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb
      a6691d0f
  21. 16 6月, 2022 1 次提交
    • P
      Fix handling of accidental truncation of IDENTITY file (#10173) · 3d358a7e
      Peter Dillinger 提交于
      Summary:
      A consequence of https://github.com/facebook/rocksdb/issues/9990 was requiring a non-empty DB ID to generate
      new SST files. But if the DB ID is not tracked in the manifest and the IDENTITY file
      is somehow truncated to 0 bytes, then an empty DB ID would be assigned, leading
      to crash. This change ensures a non-empty DB ID is assigned and set in the
      IDENTITY file.
      
      Also,
      * Some light refactoring to clean up the logic
      * (I/O efficiency) If the ID is tracked in the manifest and already matches the
      IDENTITY file, don't needlessly overwrite the file.
      * (Debugging) Log the DB ID to info log on open, because sometimes IDENTITY
      can change if DB is moved around (though it would be unusual for info log to
      be copied/moved without IDENTITY file)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10173
      
      Test Plan: unit tests expanded/updated
      
      Reviewed By: ajkr
      
      Differential Revision: D37176545
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a9b414cd35bfa33de48af322a36c24538d50bef1
      3d358a7e
  22. 04 6月, 2022 1 次提交
  23. 02 6月, 2022 1 次提交
    • Z
      Explicitly closing all directory file descriptors (#10049) · 65893ad9
      Zichen Zhu 提交于
      Summary:
      Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
      ```
      strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
      ```
      There is one directory file descriptor that is not closed in the strace log.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
      
      Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
      
      Reviewed By: ajkr, hx235
      
      Differential Revision: D36722135
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
      65893ad9
  24. 01 6月, 2022 1 次提交
  25. 27 5月, 2022 1 次提交
    • Y
      Fail DB::Open() if logger cannot be created (#9984) · 514f0b09
      Yanqin Jin 提交于
      Summary:
      For regular db instance and secondary instance, we return error and refuse to open DB if Logger creation fails.
      
      Our current code allows it, but it is really difficult to debug because
      there will be no LOG files. The same for OPTIONS file, which will be explored in another PR.
      
      Furthermore, Arena::AllocateAligned(size_t bytes, size_t huge_page_size, Logger* logger) has an
      assertion as the following:
      
      ```cpp
      #ifdef MAP_HUGETLB
      if (huge_page_size > 0 && bytes > 0) {
        assert(logger != nullptr);
      }
      #endif
      ```
      
      It can be removed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9984
      
      Test Plan: make check
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36347754
      
      Pulled By: riversand963
      
      fbshipit-source-id: 529798c0511d2eaa2f0fd40cf7e61c4cbc6bc57e
      514f0b09
  26. 20 5月, 2022 1 次提交
    • A
      Multi file concurrency in MultiGet using coroutines and async IO (#9968) · 57997dda
      anand76 提交于
      Summary:
      This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.
      
      A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.
      
      TODO:
      1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
      2. Do some stress testing with coroutines enabled
      
      No regression in synchronous MultiGet between this branch and main -
      ```
      ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
      ```
      Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```
      
      Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```
      
      More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.
      
      1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
      No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
      Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```
      
      2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
      No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
      Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```
      
      3. Single thread CPU bound workload with ~2 key overlap/file -
      No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
      Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```
      
      4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
      No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
      Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36348563
      
      Pulled By: anand1976
      
      fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
      57997dda
  27. 07 5月, 2022 1 次提交
    • S
      Remove own ToString() (#9955) · 736a7b54
      sdong 提交于
      Summary:
      ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
      
      Test Plan: Watch CI tests
      
      Reviewed By: riversand963
      
      Differential Revision: D36176799
      
      fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
      736a7b54
  28. 12 4月, 2022 1 次提交
    • A
      Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634) · ae82d914
      Akanksha Mahajan 提交于
      Summary:
      1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
      flush the data from WAL to L0 for all column families if possible. As a
      result, not all column families can increase their log_numbers, and
      min_log_number_to_keep won't change.
      2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
      
      If we persist a new MANIFEST with
      advanced log_numbers for some column families, then during a second
      crash after persisting the MANIFEST, RocksDB will see some column
      families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
      
      As a solution,
      1. the corrupted WALs whose numbers are larger than the
      corrupted wal and smaller than the new WAL will be moved to archive folder.
      2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634
      
      Test Plan:
      1. Added new unit tests
                      2. make crast_test -j
      
      Reviewed By: riversand963
      
      Differential Revision: D34463666
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19
      ae82d914
  29. 18 2月, 2022 1 次提交
    • A
      Fix some MultiGet batching stats (#9583) · 627deb7c
      anand76 提交于
      Summary:
      The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583
      
      Test Plan: Update the unit test in db_basic_test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D34308044
      
      Pulled By: anand1976
      
      fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e
      627deb7c
  30. 28 1月, 2022 1 次提交
    • Y
      Disallow a combination of options (#9348) · dd203ed6
      Yanqin Jin 提交于
      Summary:
      Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and
      `mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()`
      to loop forever and does not make much sense in direct IO.
      
      This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing
      RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether
      the application is running on a platform that supports direct IO or not. In the case of platform **not** supporting
      direct IO, it's ok if the user learns about this and then finds that direct IO is not supported.
      
      One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible
      to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately,
      the constructor is not exposed to external applications.
      
      Closing https://github.com/facebook/rocksdb/issues/7109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D33371924
      
      Pulled By: riversand963
      
      fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa
      dd203ed6
  31. 12 1月, 2022 1 次提交
  32. 06 1月, 2022 1 次提交
  33. 05 1月, 2022 1 次提交
  34. 11 12月, 2021 2 次提交
    • Y
      Add commit marker with timestamp (#9266) · bd513fd0
      Yanqin Jin 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
      
      This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
      this tag to WAL, thus it is unavailable to users.
      This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
      This diff also indicates all column families that may potentially participate in the same
      transaction must either disable timestamp or have the same timestamp format, since
      `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
      timestamp of the transaction. We will enforce this checking in a future diff. We keep this
      diff small.
      
      Reviewed By: ltamasi
      
      Differential Revision: D31721350
      
      fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
      bd513fd0
    • P
      More refactoring ahead of footer & meta changes (#9240) · 653c392e
      Peter Dillinger 提交于
      Summary:
      I'm working on a new format_version=6 to support context
      checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
      updates to support that change.
      
      Test coverage data and manual inspection agree on dead code in
      block_based_table_reader.cc (removed).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
      
      Test Plan:
      tests enhanced to cover more cases etc.
      
      Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
      
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
      
      (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
      Before w/ wal: 454512
      After w/ wal: 444820 (-2.1%)
      Before w/o wal: 1004560
      After w/o wal: 998897 (-0.6%)
      
      Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
      
      This regression will be corrected in a follow-up PR.
      
      Reviewed By: ajkr
      
      Differential Revision: D32813769
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
      653c392e
  35. 09 11月, 2021 1 次提交
    • A
      Enable a few unit tests to use custom Env objects (#9087) · dddb791c
      anand76 提交于
      Summary:
      Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
      
      Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
      
      Reviewed By: riversand963
      
      Differential Revision: D32007222
      
      Pulled By: anand1976
      
      fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
      dddb791c
  36. 03 11月, 2021 1 次提交
  37. 29 10月, 2021 1 次提交
    • P
      Implement XXH3 block checksum type (#9069) · a7d4bea4
      Peter Dillinger 提交于
      Summary:
      XXH3 - latest hash function that is extremely fast on large
      data, easily faster than crc32c on most any x86_64 hardware. In
      integrating this hash function, I have handled the compression type byte
      in a non-standard way to avoid using the streaming API (extra data
      movement and active code size because of hash function complexity). This
      approach got a thumbs-up from Yann Collet.
      
      Existing functionality change:
      * reject bad ChecksumType in options with InvalidArgument
      
      This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
      likely to be handled through different configuration than ChecksumType.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
      
      Test Plan:
      tests updated, and substantially expanded. Unit tests now check
      that we don't accidentally change the values generated by the checksum
      algorithms ("schema test") and that we properly handle
      invalid/unrecognized checksum types in options or in file footer.
      
      DBTestBase::ChangeOptions (etc.) updated from two to one configuration
      changing from default CRC32c ChecksumType. The point of this test code
      is to detect possible interactions among features, and the likelihood of
      some bad interaction being detected by including configurations other
      than XXH3 and CRC32c--and then not detected by stress/crash test--is
      extremely low.
      
      Stress/crash test also updated (manual run long enough to see it accepts
      new checksum type). db_bench also updated for microbenchmarking
      checksums.
      
       ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
      crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
      xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
      xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
      xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
      crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
      xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
      xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
      xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
      crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
      xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
      xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
      xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
      
      As you can see, especially once warmed up, xxh3 is fastest.
      
       ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      Test
      
          for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
      
      Results (ops/sec)
      
          for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
      
      results-0 252118 # kNoChecksum
      results-1 251588 # kCRC32c
      results-2 251863 # kxxHash
      results-3 252016 # kxxHash64
      results-4 252038 # kXXH3
      
      Reviewed By: mrambacher
      
      Differential Revision: D31905249
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
      a7d4bea4
  38. 19 10月, 2021 1 次提交
    • J
      Make `DB::Close()` thread-safe (#8970) · 314de7e7
      Jay Zhuang 提交于
      Summary:
      If `DB::Close()` is called in multi-thread env, the resource
      could be double released, which causes exception or assert.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970
      
      Test Plan:
      Test with multi-thread benchmark, with each thread try to
      close the DB at the end.
      
      Reviewed By: pdillinger
      
      Differential Revision: D31242042
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
      314de7e7