1. 19 3月, 2022 1 次提交
    • P
      New backup meta schema, with file temperatures (#9660) · cff0d1e8
      Peter Dillinger 提交于
      Summary:
      The primary goal of this change is to add support for backing up and
      restoring (applying on restore) file temperature metadata, without
      committing to either the DB manifest or the FS reported "current"
      temperatures being exclusive "source of truth".
      
      To achieve this goal, we need to add temperature information to backup
      metadata, which requires updated backup meta schema. Fortunately I
      prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
      6.19.0 for this kind of schema update. (Previously, backup meta schema
      was not extensible! Making this schema update public will allow some
      other "nice to have" features like taking backups with hard links, and
      avoiding crc32c checksum computation when another checksum is already
      available.) While schema version 2 is newly public, the default schema
      version is still 1. Until we change the default, users will need to set
      to 2 to enable features like temperature data backup+restore. New
      metadata like temperature information will be ignored with a warning
      in versions before this change and since 6.19.0. The metadata is
      considered ignorable because a functioning DB can be restored without
      it.
      
      Some detail:
      * Some renaming because "future schema" is now just public schema 2.
      * Initialize some atomics in TestFs (linter reported)
      * Add temperature hint support to SstFileDumper (used by BackupEngine)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
      
      Test Plan:
      related unit test majorly updated for the new functionality,
      including some shared testing support for tracking temperatures in a FS.
      
      Some other tests and testing hooks into production code also updated for
      making the backup meta schema change public.
      
      Reviewed By: ajkr
      
      Differential Revision: D34686968
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
      cff0d1e8
  2. 05 3月, 2022 1 次提交
    • P
      Test refactoring for Backups+Temperatures (#9655) · ce60d0cb
      Peter Dillinger 提交于
      Summary:
      In preparation for more support for file Temperatures in BackupEngine,
      this change does some test refactoring:
      * Move DBTest2::BackupFileTemperature test to
      BackupEngineTest::FileTemperatures, with some updates to make it work
      in the new home. This test will soon be expanded for deeper backup work.
      * Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
      support sharing because of above moved test, but split off the "no link"
      part to the test needing it.
      * Use custom FileSystems in backupable_db_test rather than custom Envs,
      because going through Env file interfaces doesn't support temperatures.
      * Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
      parameter to FsyncWithDirOptions, which was required because this
      limitation caused a crash only after moving to higher fidelity of
      FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
      parameter details)
      * `backupable_options_` -> `engine_options_` as part of the ongoing
      work to get rid of the obsolete "backupable" naming.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
      
      Test Plan: test code updates only
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34622183
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
      ce60d0cb
  3. 29 1月, 2022 1 次提交
  4. 05 1月, 2022 1 次提交
  5. 11 12月, 2021 1 次提交
    • P
      More refactoring ahead of footer & meta changes (#9240) · 653c392e
      Peter Dillinger 提交于
      Summary:
      I'm working on a new format_version=6 to support context
      checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
      updates to support that change.
      
      Test coverage data and manual inspection agree on dead code in
      block_based_table_reader.cc (removed).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
      
      Test Plan:
      tests enhanced to cover more cases etc.
      
      Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
      
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
      
      (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
      Before w/ wal: 454512
      After w/ wal: 444820 (-2.1%)
      Before w/o wal: 1004560
      After w/o wal: 998897 (-0.6%)
      
      Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
      
      This regression will be corrected in a follow-up PR.
      
      Reviewed By: ajkr
      
      Differential Revision: D32813769
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
      653c392e
  6. 29 10月, 2021 1 次提交
    • P
      Implement XXH3 block checksum type (#9069) · a7d4bea4
      Peter Dillinger 提交于
      Summary:
      XXH3 - latest hash function that is extremely fast on large
      data, easily faster than crc32c on most any x86_64 hardware. In
      integrating this hash function, I have handled the compression type byte
      in a non-standard way to avoid using the streaming API (extra data
      movement and active code size because of hash function complexity). This
      approach got a thumbs-up from Yann Collet.
      
      Existing functionality change:
      * reject bad ChecksumType in options with InvalidArgument
      
      This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
      likely to be handled through different configuration than ChecksumType.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
      
      Test Plan:
      tests updated, and substantially expanded. Unit tests now check
      that we don't accidentally change the values generated by the checksum
      algorithms ("schema test") and that we properly handle
      invalid/unrecognized checksum types in options or in file footer.
      
      DBTestBase::ChangeOptions (etc.) updated from two to one configuration
      changing from default CRC32c ChecksumType. The point of this test code
      is to detect possible interactions among features, and the likelihood of
      some bad interaction being detected by including configurations other
      than XXH3 and CRC32c--and then not detected by stress/crash test--is
      extremely low.
      
      Stress/crash test also updated (manual run long enough to see it accepts
      new checksum type). db_bench also updated for microbenchmarking
      checksums.
      
       ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
      crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
      xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
      xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
      xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
      crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
      xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
      xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
      xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
      crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
      xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
      xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
      xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
      
      As you can see, especially once warmed up, xxh3 is fastest.
      
       ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      Test
      
          for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
      
      Results (ops/sec)
      
          for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
      
      results-0 252118 # kNoChecksum
      results-1 251588 # kCRC32c
      results-2 251863 # kxxHash
      results-3 252016 # kxxHash64
      results-4 252038 # kXXH3
      
      Reviewed By: mrambacher
      
      Differential Revision: D31905249
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
      a7d4bea4
  7. 19 10月, 2021 1 次提交
    • P
      Experimental support for SST unique IDs (#8990) · ad5325a7
      Peter Dillinger 提交于
      Summary:
      * New public header unique_id.h and function GetUniqueIdFromTableProperties
      which computes a universally unique identifier based on table properties
      of table files from recent RocksDB versions.
      * Generation of DB session IDs is refactored so that they are
      guaranteed unique in the lifetime of a process running RocksDB.
      (SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
      this enables SST unique IDs to be guaranteed unique among SSTs generated
      in a single process, and "better than random" between processes.
      See https://github.com/pdillinger/unique_id
      * In addition to public API producing 'external' unique IDs, there is a function
      for producing 'internal' unique IDs, with functions for converting between the
      two. In short, the external ID is "safe" for things people might do with it, and
      the internal ID enables more "power user" features for the future. Specifically,
      the external ID goes through a hashing layer so that any subset of bits in the
      external ID can be used as a hash of the full ID, while also preserving
      uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
      and on full 192 bits).
      
      Intended follow-up:
      * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
      the third 64-bit value of the unique ID.)
      * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
      
      Test Plan:
      Unit tests added, and checking of unique ids in stress test.
      NOTE in stress test we do not generate nearly enough files to thoroughly
      stress uniqueness, but the test trims off pieces of the ID to check for
      uniqueness so that we can infer (with some assumptions) stronger
      properties in the aggregate.
      
      Reviewed By: zhichao-cao, mrambacher
      
      Differential Revision: D31582865
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
      ad5325a7
  8. 29 9月, 2021 1 次提交
    • M
      Cleanup includes in dbformat.h (#8930) · 13ae16c3
      mrambacher 提交于
      Summary:
      This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.
      
      Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
      
      Hopefully, this sort of code hygiene cleanup will speed up the builds...
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
      
      Reviewed By: pdillinger
      
      Differential Revision: D31142788
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
      13ae16c3
  9. 08 9月, 2021 1 次提交
    • M
      Make MemTableRepFactory into a Customizable class (#8419) · beed8647
      mrambacher 提交于
      Summary:
      This PR does the following:
      -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
      -> Makes the existing implementations compatible with configurations
      -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
      
      New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29558961
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
      beed8647
  10. 16 8月, 2021 1 次提交
  11. 05 8月, 2021 1 次提交
    • A
      Do not attempt to rename non-existent info log (#8622) · a685a701
      Andrew Kryczka 提交于
      Summary:
      Previously we attempted to rename "LOG" to "LOG.old.*" without checking
      its existence first. "LOG" had no reason to exist in a new DB.
      
      Errors in renaming a non-existent "LOG" were swallowed via
      `PermitUncheckedError()` so things worked. However the storage service's
      error monitoring was detecting all these benign rename failures. So it
      is better to fix it. Also with this PR we can now distinguish rename failure
      for other reasons and return them.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622
      
      Test Plan: new unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D30115189
      
      Pulled By: ajkr
      
      fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170
      a685a701
  12. 27 7月, 2021 1 次提交
    • M
      Make EventListener into a Customizable Class (#8473) · 3aee4fbd
      mrambacher 提交于
      Summary:
      - Added Type/CreateFromString
      - Added ability to load EventListeners to DBOptions
      - Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29901488
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
      3aee4fbd
  13. 18 6月, 2021 1 次提交
    • A
      Cache warming data blocks during flush (#8242) · 5ba1b6e5
      Akanksha Mahajan 提交于
      Summary:
      This PR prepopulates warm/hot data blocks which are already in memory
      into block cache at the time of flush. On a flush, the data block that is
      in memory (in memtables) get flushed to the device. If using Direct IO,
      additional IO is incurred to read this data back into memory again, which
      is avoided by enabling newly added option.
      
       Right now, this is enabled only for flush for data blocks. We plan to
      expand this option to cover compactions in the future and for other types
       of blocks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
      
      Test Plan: Add new unit test
      
      Reviewed By: anand1976
      
      Differential Revision: D28521703
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
      5ba1b6e5
  14. 20 5月, 2021 1 次提交
    • P
      Use deleters to label cache entries and collect stats (#8297) · 311a544c
      Peter Dillinger 提交于
      Summary:
      This change gathers and publishes statistics about the
      kinds of items in block cache. This is especially important for
      profiling relative usage of cache by index vs. filter vs. data blocks.
      It works by iterating over the cache during periodic stats dump
      (InternalStats, stats_dump_period_sec) or on demand when
      DB::Get(Map)Property(kBlockCacheEntryStats), except that for
      efficiency and sharing among column families, saved data from
      the last scan is used when the data is not considered too old.
      
      The new information can be seen in info LOG, for example:
      
          Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
          Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
      
      And also through DB::GetProperty and GetMapProperty (here using
      ldb just for demonstration):
      
          $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
          rocksdb.block-cache-entry-stats.bytes.data-block: 0
          rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.bytes.index-block: 178992
          rocksdb.block-cache-entry-stats.bytes.misc: 0
          rocksdb.block-cache-entry-stats.bytes.other-block: 0
          rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
          rocksdb.block-cache-entry-stats.capacity: 8388608
          rocksdb.block-cache-entry-stats.count.data-block: 0
          rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.count.index-block: 215
          rocksdb.block-cache-entry-stats.count.misc: 1
          rocksdb.block-cache-entry-stats.count.other-block: 0
          rocksdb.block-cache-entry-stats.count.write-buffer: 0
          rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
          rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
          rocksdb.block-cache-entry-stats.percent.misc: 0.000000
          rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
          rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
          rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
      
      Solution detail - We need some way to flag what kind of blocks each
      entry belongs to, preferably without changing the Cache API.
      One of the complications is that Cache is a general interface that could
      have other users that don't adhere to whichever convention we decide
      on for keys and values. Or we would pay for an extra field in the Handle
      that would only be used for this purpose.
      
      This change uses a back-door approach, the deleter, to indicate the
      "role" of a Cache entry (in addition to the value type, implicitly).
      This has the added benefit of ensuring proper code origin whenever we
      recognize a particular role for a cache entry; if the entry came from
      some other part of the code, it will use an unrecognized deleter, which
      we simply attribute to the "Misc" role.
      
      An internal API makes for simple instantiation and automatic
      registration of Cache deleters for a given value type and "role".
      
      Another internal API, CacheEntryStatsCollector, solves the problem of
      caching the results of a scan and sharing them, to ensure scans are
      neither excessive nor redundant so as not to harm Cache performance.
      
      Because code is added to BlocklikeTraits, it is pulled out of
      block_based_table_reader.cc into its own file.
      
      This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
      (could still be added), and with actual stat gathering.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
      
      Test Plan: manual testing with db_bench, and a couple of basic unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28488721
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
      311a544c
  15. 14 5月, 2021 1 次提交
    • A
      Initial support for secondary cache in LRUCache (#8271) · feb06e83
      anand76 提交于
      Summary:
      Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
      
      Tests:
      Results from cache_bench and db_bench don't show any regression due to these changes.
      
      cache_bench results before and after this change -
      Command
      ```./cache_bench -ops_per_thread=10000000 -threads=1```
      Before
      ```Complete in 40.688 s; QPS = 245774```
      ```Complete in 40.486 s; QPS = 246996```
      ```Complete in 42.019 s; QPS = 237989```
      After
      ```Complete in 40.672 s; QPS = 245869```
      ```Complete in 44.622 s; QPS = 224107```
      ```Complete in 42.445 s; QPS = 235599```
      
      db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
      Commands
      ```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
      
      ```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
      Before
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
      ```
      After
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28357511
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
      feb06e83
  16. 12 5月, 2021 1 次提交
    • P
      New Cache API for gathering statistics (#8225) · 78a309bf
      Peter Dillinger 提交于
      Summary:
      Adds a new Cache::ApplyToAllEntries API that we expect to use
      (in follow-up PRs) for efficiently gathering block cache statistics.
      Notable features vs. old ApplyToAllCacheEntries:
      
      * Includes key and deleter (in addition to value and charge). We could
      have passed in a Handle but then more virtual function calls would be
      needed to get the "fields" of each entry. We expect to use the 'deleter'
      to identify the origin of entries, perhaps even more.
      * Heavily tuned to minimize latency impact on operating cache. It
      does this by iterating over small sections of each cache shard while
      cycling through the shards.
      * Supports tuning roughly how many entries to operate on for each
      lock acquire and release, to control the impact on the latency of other
      operations without excessive lock acquire & release. The right balance
      can depend on the cost of the callback. Good default seems to be
      around 256.
      * There should be no need to disable thread safety. (I would expect
      uncontended locks to be sufficiently fast.)
      
      I have enhanced cache_bench to validate this approach:
      
      * Reports a histogram of ns per operation, so we can look at the
      ditribution of times, not just throughput (average).
      * Can add a thread for simulated "gather stats" which calls
      ApplyToAllEntries at a specified interval. We also generate a histogram
      of time to run ApplyToAllEntries.
      
      To make the iteration over some entries of each shard work as cleanly as
      possible, even with resize between next set of entries, I have
      re-arranged which hash bits are used for sharding and which for indexing
      within a shard.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
      
      Test Plan:
      A couple of unit tests are added, but primary validation is manual, as
      the primary risk is to performance.
      
      The primary validation is using cache_bench to ensure that neither
      the minor hashing changes nor the simulated stats gathering
      significantly impact QPS or latency distribution. Note that adding op
      latency histogram seriously impacts the benchmark QPS, so for a
      fair baseline, we need the cache_bench changes (except remove simulated
      stat gathering to make it compile). In short, we don't see any
      reproducible difference in ops/sec or op latency unless we are gathering
      stats nearly continuously. Test uses 10GB block cache with
      8KB values to be somewhat realistic in the number of items to iterate
      over.
      
      Baseline typical output:
      
      ```
      Complete in 92.017 s; Rough parallel ops/sec = 869401
      Thread ops/sec = 54662
      
      Operation latency (ns):
      Count: 80000000 Average: 11223.9494  StdDev: 29.61
      Min: 0  Median: 7759.3973  Max: 9620500
      Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
      ------------------------------------------------------
      [       0,       1 ]       68   0.000%   0.000%
      (    2900,    4400 ]       89   0.000%   0.000%
      (    4400,    6600 ] 33630240  42.038%  42.038% ########
      (    6600,    9900 ] 18129842  22.662%  64.700% #####
      (    9900,   14000 ]  7877533   9.847%  74.547% ##
      (   14000,   22000 ] 15193238  18.992%  93.539% ####
      (   22000,   33000 ]  3037061   3.796%  97.335% #
      (   33000,   50000 ]  1626316   2.033%  99.368%
      (   50000,   75000 ]   421532   0.527%  99.895%
      (   75000,  110000 ]    56910   0.071%  99.966%
      (  110000,  170000 ]    16134   0.020%  99.986%
      (  170000,  250000 ]     5166   0.006%  99.993%
      (  250000,  380000 ]     3017   0.004%  99.996%
      (  380000,  570000 ]     1337   0.002%  99.998%
      (  570000,  860000 ]      805   0.001%  99.999%
      (  860000, 1200000 ]      319   0.000% 100.000%
      ( 1200000, 1900000 ]      231   0.000% 100.000%
      ( 1900000, 2900000 ]      100   0.000% 100.000%
      ( 2900000, 4300000 ]       39   0.000% 100.000%
      ( 4300000, 6500000 ]       16   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ```
      
      New, gather_stats=false. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 92.030 s; Rough parallel ops/sec = 869285
      Thread ops/sec = 54458
      
      Operation latency (ns):
      Count: 80000000 Average: 11298.1027  StdDev: 42.18
      Min: 0  Median: 7722.0822  Max: 6398720
      Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
      ------------------------------------------------------
      [       0,       1 ]      109   0.000%   0.000%
      (    2900,    4400 ]      793   0.001%   0.001%
      (    4400,    6600 ] 34054563  42.568%  42.569% #########
      (    6600,    9900 ] 17482646  21.853%  64.423% ####
      (    9900,   14000 ]  7908180   9.885%  74.308% ##
      (   14000,   22000 ] 15032072  18.790%  93.098% ####
      (   22000,   33000 ]  3237834   4.047%  97.145% #
      (   33000,   50000 ]  1736882   2.171%  99.316%
      (   50000,   75000 ]   446851   0.559%  99.875%
      (   75000,  110000 ]    68251   0.085%  99.960%
      (  110000,  170000 ]    18592   0.023%  99.983%
      (  170000,  250000 ]     7200   0.009%  99.992%
      (  250000,  380000 ]     3334   0.004%  99.997%
      (  380000,  570000 ]     1393   0.002%  99.998%
      (  570000,  860000 ]      700   0.001%  99.999%
      (  860000, 1200000 ]      293   0.000% 100.000%
      ( 1200000, 1900000 ]      196   0.000% 100.000%
      ( 1900000, 2900000 ]       69   0.000% 100.000%
      ( 2900000, 4300000 ]       32   0.000% 100.000%
      ( 4300000, 6500000 ]       10   0.000% 100.000%
      ```
      
      New, gather_stats=true, 1 second delay between scans. Scans take about
      1 second here so it's spending about 50% time scanning. Still the effect on
      ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 91.890 s; Rough parallel ops/sec = 870608
      Thread ops/sec = 54551
      
      Operation latency (ns):
      Count: 80000000 Average: 11311.2629  StdDev: 45.28
      Min: 0  Median: 7686.5458  Max: 10018340
      Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
      ------------------------------------------------------
      [       0,       1 ]       71   0.000%   0.000%
      (    2900,    4400 ]      291   0.000%   0.000%
      (    4400,    6600 ] 34492060  43.115%  43.116% #########
      (    6600,    9900 ] 16727328  20.909%  64.025% ####
      (    9900,   14000 ]  7845828   9.807%  73.832% ##
      (   14000,   22000 ] 15510654  19.388%  93.220% ####
      (   22000,   33000 ]  3216533   4.021%  97.241% #
      (   33000,   50000 ]  1680859   2.101%  99.342%
      (   50000,   75000 ]   439059   0.549%  99.891%
      (   75000,  110000 ]    60540   0.076%  99.967%
      (  110000,  170000 ]    14649   0.018%  99.985%
      (  170000,  250000 ]     5242   0.007%  99.991%
      (  250000,  380000 ]     3260   0.004%  99.995%
      (  380000,  570000 ]     1599   0.002%  99.997%
      (  570000,  860000 ]     1043   0.001%  99.999%
      (  860000, 1200000 ]      471   0.001%  99.999%
      ( 1200000, 1900000 ]      275   0.000% 100.000%
      ( 1900000, 2900000 ]      143   0.000% 100.000%
      ( 2900000, 4300000 ]       60   0.000% 100.000%
      ( 4300000, 6500000 ]       27   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ( 9800000, 14000000 ]        1   0.000% 100.000%
      
      Gather stats latency (us):
      Count: 46 Average: 980387.5870  StdDev: 60911.18
      Min: 879155  Median: 1033777.7778  Max: 1261431
      Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
      ------------------------------------------------------
      (  860000, 1200000 ]       45  97.826%  97.826% ####################
      ( 1200000, 1900000 ]        1   2.174% 100.000%
      
      Most recent cache entry stats:
      Number of entries: 1295133
      Total charge: 9.88 GB
      Average key size: 23.4982
      Average charge: 8.00 KB
      Unique deleters: 3
      ```
      
      Reviewed By: mrambacher
      
      Differential Revision: D28295742
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
      78a309bf
  17. 23 4月, 2021 1 次提交
    • Z
      Fix the false positive alert of CF consistency check in WAL recovery (#8207) · 09a9ec3a
      Zhichao Cao 提交于
      Summary:
      In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
      
      Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
      
      Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
      
      Test Plan: added unit test, make crash_test
      
      Reviewed By: riversand963
      
      Differential Revision: D27898839
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
      09a9ec3a
  18. 20 4月, 2021 2 次提交
    • J
      Fix unittest no space issue (#8204) · a89740fb
      Jay Zhuang 提交于
      Summary:
      Unittest reports no space from time to time, which can be reproduced on a small memory machine with SHM. It's caused by large WAL files generated during the test, which is preallocated, but didn't truncate during close(). Adding the missing APIs to set preallocation.
      It added arm test as nightly build, as the test runs more than 1 hour.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8204
      
      Test Plan: test on small memory arm machine
      
      Reviewed By: mrambacher
      
      Differential Revision: D27873145
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: f797c429d6bc13cbcc673bc03fcc72adda55f506
      a89740fb
    • Y
      Handle rename() failure in non-local FS (#8192) · a376c220
      Yanqin Jin 提交于
      Summary:
      In a distributed environment, a file `rename()` operation can succeed on server (remote)
      side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
      network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
      can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
      new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
      
      This is problematic in distributed world. If the server-side successfully updates the CURRENT
      file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
      
      As a fix, we can track the execution result of IO operations on the new MANIFEST.
      - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
        MANIFEST. Therefore, it is safe to remove the new MANIFEST.
      - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
        code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
        POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
        new MANIFEST.) Therefore, we keep the new MANIFEST.
          - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
          - If process reopens the db immediately after the failure, then the CURRENT file can point
            to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
            succeed and ignore the other.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
      
      Test Plan: make check
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27804648
      
      Pulled By: riversand963
      
      fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
      a376c220
  19. 08 4月, 2021 1 次提交
    • G
      Fix flush reason attribution (#8150) · 48cd7a3a
      Giuseppe Ottaviano 提交于
      Summary:
      Current flush reason attribution is misleading or incorrect (depending on what the original intention was):
      
      - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
      - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`
      
      This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27569645
      
      Pulled By: ot
      
      fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
      48cd7a3a
  20. 20 3月, 2021 1 次提交
  21. 18 3月, 2021 1 次提交
    • A
      Use SST file manager to track blob files as well (#8037) · 27d57a03
      Akanksha Mahajan 提交于
      Summary:
      Extend support to track blob files in SST File manager.
       This PR notifies SstFileManager whenever a new blob file is created,
       via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
       and delete file via ScheduleFileDeletion.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
      
      Test Plan: Add new unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D26891237
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
      27d57a03
  22. 26 1月, 2021 1 次提交
    • M
      Add a SystemClock class to capture the time functions of an Env (#7858) · 12f11373
      mrambacher 提交于
      Summary:
      Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
      
      Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
      
      There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
      
      Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
      
      Reviewed By: pdillinger
      
      Differential Revision: D26006406
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
      12f11373
  23. 07 1月, 2021 1 次提交
    • A
      Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715) · 6e0f62f2
      Adam Retter 提交于
      Summary:
      Third batch of adding more tests to ASSERT_STATUS_CHECKED.
      
      * db_compaction_filter_test
      * db_compaction_test
      * db_dynamic_level_test
      * db_inplace_update_test
      * db_sst_test
      * db_tailing_iter_test
      * db_io_failure_test
      
      Also update GetApproximateSizes APIs to all return Status.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D25806896
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
      6e0f62f2
  24. 28 10月, 2020 1 次提交
    • M
      Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566) · f35f7f27
      mrambacher 提交于
      Summary:
      This PR does a few things:
      
      1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
      
      2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
      - The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
      - The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
      
      3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
      
      I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
      
      Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
      
      Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D24408980
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
      f35f7f27
  25. 30 9月, 2020 1 次提交
  26. 15 9月, 2020 1 次提交
  27. 26 8月, 2020 1 次提交
    • S
      Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297) · 722814e3
      sdong 提交于
      Summary:
      Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
      
      Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
      
      Reviewed By: anand1976
      
      Differential Revision: D23257950
      
      fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
      722814e3
  28. 20 8月, 2020 1 次提交
    • J
      Fix a timer_test deadlock (#7277) · 3e422ce0
      Jay Zhuang 提交于
      Summary:
      There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277
      
      Reviewed By: pdillinger
      
      Differential Revision: D23183873
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
      3e422ce0
  29. 18 8月, 2020 1 次提交
  30. 15 8月, 2020 1 次提交
    • J
      Introduce a global StatsDumpScheduler for stats dumping (#7223) · 69760b4d
      Jay Zhuang 提交于
      Summary:
      Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
      
      Reviewed By: riversand963
      
      Differential Revision: D23056737
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
      69760b4d
  31. 12 8月, 2020 1 次提交
    • P
      Fix+clean up handling of mock sleeps (#7101) · 6ac1d25f
      Peter Dillinger 提交于
      Summary:
      We have a number of tests hanging on MacOS and windows due to
      mishandling of code for mock sleeps. In addition, the code was in
      terrible shape because the same variable (addon_time_) would sometimes
      refer to microseconds and sometimes to seconds. One test even assumed it
      was nanoseconds but was written to pass anyway.
      
      This has been cleaned up so that DB tests generally use a SpecialEnv
      function to mock sleep, for either some number of microseconds or seconds
      depending on the function called. But to call one of these, the test must first
      call SetMockSleep (precondition enforced with assertion), which also turns
      sleeps in RocksDB into mock sleeps. To also removes accounting for actual
      clock time, call SetTimeElapseOnlySleepOnReopen, which implies
      SetMockSleep (on DB re-open). This latter setting only works by applying
      on DB re-open, otherwise havoc can ensue if Env goes back in time with
      DB open.
      
      More specifics:
      
      Removed some unused test classes, and updated comments on the general
      problem.
      
      Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
      of mock time. For this we have the only modification to production code,
      inserting a sync point callback in flush_job.cc, which is not a change to
      production behavior.
      
      Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
      deals in relative time. Any behaviors relying on absolute date/time are likely
      a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
      clearly injecting a specific absolute time for actual testing convenience.) Just
      in case I misunderstood some test, I put this note in each replacement:
      // NOTE: Presumed unnecessary and removed: resetting mock time in env
      
      Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
      FilterCompactionTimeTest in db_test.cc
      
      stats_history_test and blob_db_test are each their own beast, rather deeply
      dependent on MockTimeEnv. Each gets its own variant of a work-around for
      TimedWait in a mock time environment. (Reduces redundancy and
      inconsistency in stats_history_test.)
      
      Intended follow-up:
      
      Remove TimedWait from the public API of InstrumentedCondVar, and only
      make that accessible through Env by passing in an InstrumentedCondVar and
      a deadline. Then the Env implementations mocking time can fix this problem
      without using sync points. (Test infrastructure using sync points interferes
      with individual tests' control over sync points.)
      
      With that change, we can simplify/consolidate the scattered work-arounds.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
      
      Test Plan: make check on Linux and MacOS
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D23032815
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
      6ac1d25f
  32. 10 7月, 2020 1 次提交
    • M
      More Makefile Cleanup (#7097) · c7c7b07f
      mrambacher 提交于
      Summary:
      Cleans up some of the dependencies on test code in the Makefile while building tools:
      - Moves the test::RandomString, DBBaseTest::RandomString into Random
      - Moves the test::RandomHumanReadableString into Random
      - Moves the DestroyDir method into file_utils
      - Moves the SetupSyncPointsToMockDirectIO into sync_point.
      - Moves the FaultInjection Env and FS classes under env
      
      These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.
      
      Tested both release and debug builds via Make and CMake for both static and shared libraries.
      
      More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097
      
      Reviewed By: riversand963
      
      Differential Revision: D22463160
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
      c7c7b07f
  33. 03 7月, 2020 2 次提交
  34. 02 7月, 2020 1 次提交
  35. 30 6月, 2020 1 次提交
    • S
      Disable fsync in some tests to speed them up (#7036) · 58547e53
      sdong 提交于
      Summary:
      Fsyncing files is not providing more test coverage in many tests. Provide an option in SpecialEnv to turn it off to speed it up and enable this option in some tests with relatively long run time.
      Most of those tests can be divided as parameterized gtest too. This two speed up approaches are orthogonal and we can do both if needed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7036
      
      Test Plan: Run all tests and make sure they pass.
      
      Reviewed By: ltamasi
      
      Differential Revision: D22268084
      
      fbshipit-source-id: 6d4a838a1b7328c13931a2a5d93de57aa02afaab
      58547e53
  36. 06 5月, 2020 1 次提交
  37. 28 4月, 2020 1 次提交
    • P
      Stats for redundant insertions into block cache (#6681) · 249eff0f
      Peter Dillinger 提交于
      Summary:
      Since read threads do not coordinate on loading data into block
      cache, two threads between Lookup and Insert can end up loading and
      inserting the same data. This is particularly concerning with
      cache_index_and_filter_blocks since those are hot and more likely to
      be race targets if ejected from (or not pre-populated in) the cache.
      
      Particularly with moves toward disaggregated / network storage, the cost
      of redundant retrieval might be high, and we should at least have some
      hard statistics from which we can estimate impact.
      
      Example with full filter thrashing "cliff":
      
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
          ...
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 14181
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 476
          rocksdb.block.cache.data.add COUNT : 12749
          rocksdb.block.cache.data.add.redundant COUNT : 18
          rocksdb.block.cache.filter.add COUNT : 1003
          rocksdb.block.cache.filter.add.redundant COUNT : 217
          rocksdb.block.cache.index.add COUNT : 429
          rocksdb.block.cache.index.add.redundant COUNT : 241
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 1182223
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 302728
          rocksdb.block.cache.data.add COUNT : 31425
          rocksdb.block.cache.data.add.redundant COUNT : 12
          rocksdb.block.cache.filter.add COUNT : 795455
          rocksdb.block.cache.filter.add.redundant COUNT : 130238
          rocksdb.block.cache.index.add COUNT : 355343
          rocksdb.block.cache.index.add.redundant COUNT : 172478
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681
      
      Test Plan: Some manual testing (above) and unit test covering key metrics is included
      
      Reviewed By: ltamasi
      
      Differential Revision: D21134113
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
      249eff0f
  38. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de