1. 20 5月, 2022 1 次提交
    • A
      Multi file concurrency in MultiGet using coroutines and async IO (#9968) · 57997dda
      anand76 提交于
      Summary:
      This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.
      
      A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.
      
      TODO:
      1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
      2. Do some stress testing with coroutines enabled
      
      No regression in synchronous MultiGet between this branch and main -
      ```
      ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
      ```
      Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```
      
      Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```
      
      More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.
      
      1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
      No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
      Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```
      
      2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
      No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
      Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```
      
      3. Single thread CPU bound workload with ~2 key overlap/file -
      No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
      Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```
      
      4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
      No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
      Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36348563
      
      Pulled By: anand1976
      
      fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
      57997dda
  2. 18 5月, 2022 1 次提交
    • H
      Rewrite memory-charging feature's option API (#9926) · 3573558e
      Hui Xiao 提交于
      Summary:
      **Context:**
      Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
      
      Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
      
      **Summary:**
      - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
      - Added missing db bench/stress test for some memory charging features
      - Renamed related test suite to indicate they are under the same theme of memory charging
      - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
      - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
      
      Test Plan:
      - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
      - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
      - CI
      - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR  -charge_compression_dictionary_building_buffer=1(remove this for comparison)  -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
      
      - db_stress: `python3 tools/db_crashtest.py blackbox  -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36054712
      
      Pulled By: hx235
      
      fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
      3573558e
  3. 27 4月, 2022 1 次提交
    • P
      Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) · 9d0cae71
      Peter Dillinger 提交于
      Summary:
      When MultiGet() determines that multiple query keys can be
      served by examining the same data block in block cache (one Lookup()),
      each PinnableSlice referring to data in that data block needs to hold
      on to the block in cache so that they can be released at arbitrary
      times by the API user. Historically this is accomplished with extra
      calls to Ref() on the Handle from Lookup(), with each PinnableSlice
      cleanup calling Release() on the Handle, but this creates extra
      contention on the block cache for the extra Ref()s and Release()es,
      especially because they hit the same cache shard repeatedly.
      
      In the case of merge operands (possibly more cases?), the problem was
      compounded by doing an extra Ref()+eventual Release() for each merge
      operand for a key reusing a block (which could be the same key!), rather
      than one Ref() per key. (Note: the non-shared case with `biter` was
      already one per key.)
      
      This change optimizes MultiGet not to rely on these extra, contentious
      Ref()+Release() calls by instead, in the shared block case, wrapping
      the cache Release() cleanup in a refcounted object referenced by the
      PinnableSlices, such that after the last wrapped reference is released,
      the cache entry is Release()ed. Relaxed atomic refcounts should be
      much faster than mutex-guarded Ref() and Release(), and much less prone
      to a performance cliff when MultiGet() does a lot of block sharing.
      
      Note that I did not use std::shared_ptr, because that would require an
      extra indirection object (shared_ptr itself new/delete) in order to
      associate a ref increment/decrement with a Cleanable cleanup entry. (If
      I assumed it was the size of two pointers, I could do some hackery to
      make it work without the extra indirection, but that's too fragile.)
      
      Some details:
      * Fixed (removed) extra block cache tracing entries in cases of cache
      entry reuse in MultiGet, but it's likely that in some other cases traces
      are missing (XXX comment inserted)
      * Moved existing implementations for cleanable.h from iterator.cc to
      new cleanable.cc
      * Improved API comments on Cleanable
      * Added a public SharedCleanablePtr class to cleanable.h in case others
      could benefit from the same pattern (potentially many Cleanables and/or
      smart pointers referencing a shared Cleanable)
      * Add a typedef for MultiGetContext::Mask
      * Some variable renaming for clarity
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899
      
      Test Plan:
      Added unit tests for SharedCleanablePtr.
      
      Greatly enhanced ability of existing tests to detect cache use-after-free.
      * Release PinnableSlices from MultiGet as they are read rather than in
      bulk (in db_test_util wrapper).
      * In ASAN build, default to using a trivially small LRUCache for block_cache
      so that entries are immediately erased when unreferenced. (Updated two
      tests that depend on caching.) New ASAN testsuite running time seems
      OK to me.
      
      If I introduce a bug into my implementation where we skip the shared
      cleanups on block reuse, ASAN detects the bug in
      `db_basic_test *MultiGet*`. If I remove either of the above testing
      enhancements, the bug is not detected.
      
      Consider for follow-up work: manipulate or randomize ordering of
      PinnableSlice use and release from MultiGet db_test_util wrapper. But in
      typical cases, natural ordering gives pretty good functional coverage.
      
      Performance test:
      In the extreme (but possible) case of MultiGetting the same or adjacent keys
      in a batch, throughput can improve by an order of magnitude.
      `./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
      Before ops/sec, num=5: 1,384,394
      Before ops/sec, num=500: 6,423,720
      After ops/sec, num=500: 10,658,794
      After ops/sec, num=5: 16,027,257
      
      Also note that previously, with high parallelism, having query keys
      concentrated in a single block was worse than spreading them out a bit. Now
      concentrated in a single block is faster than spread out, which is hopefully
      consistent with natural expectation.
      
      Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
      Before: multireadrandom [AVG    999 runs] : 1088699 (± 7344) ops/sec;  120.4 (± 0.8 ) MB/sec
      After: multireadrandom [AVG    999 runs] : 1090402 (± 7230) ops/sec;  120.6 (± 0.8 ) MB/sec
      Possibly better, possibly in the noise.
      
      Reviewed By: anand1976
      
      Differential Revision: D35907003
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e
      9d0cae71
  4. 23 3月, 2022 1 次提交
  5. 19 3月, 2022 1 次提交
    • P
      New backup meta schema, with file temperatures (#9660) · cff0d1e8
      Peter Dillinger 提交于
      Summary:
      The primary goal of this change is to add support for backing up and
      restoring (applying on restore) file temperature metadata, without
      committing to either the DB manifest or the FS reported "current"
      temperatures being exclusive "source of truth".
      
      To achieve this goal, we need to add temperature information to backup
      metadata, which requires updated backup meta schema. Fortunately I
      prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
      6.19.0 for this kind of schema update. (Previously, backup meta schema
      was not extensible! Making this schema update public will allow some
      other "nice to have" features like taking backups with hard links, and
      avoiding crc32c checksum computation when another checksum is already
      available.) While schema version 2 is newly public, the default schema
      version is still 1. Until we change the default, users will need to set
      to 2 to enable features like temperature data backup+restore. New
      metadata like temperature information will be ignored with a warning
      in versions before this change and since 6.19.0. The metadata is
      considered ignorable because a functioning DB can be restored without
      it.
      
      Some detail:
      * Some renaming because "future schema" is now just public schema 2.
      * Initialize some atomics in TestFs (linter reported)
      * Add temperature hint support to SstFileDumper (used by BackupEngine)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
      
      Test Plan:
      related unit test majorly updated for the new functionality,
      including some shared testing support for tracking temperatures in a FS.
      
      Some other tests and testing hooks into production code also updated for
      making the backup meta schema change public.
      
      Reviewed By: ajkr
      
      Differential Revision: D34686968
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
      cff0d1e8
  6. 05 3月, 2022 1 次提交
    • P
      Test refactoring for Backups+Temperatures (#9655) · ce60d0cb
      Peter Dillinger 提交于
      Summary:
      In preparation for more support for file Temperatures in BackupEngine,
      this change does some test refactoring:
      * Move DBTest2::BackupFileTemperature test to
      BackupEngineTest::FileTemperatures, with some updates to make it work
      in the new home. This test will soon be expanded for deeper backup work.
      * Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
      support sharing because of above moved test, but split off the "no link"
      part to the test needing it.
      * Use custom FileSystems in backupable_db_test rather than custom Envs,
      because going through Env file interfaces doesn't support temperatures.
      * Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
      parameter to FsyncWithDirOptions, which was required because this
      limitation caused a crash only after moving to higher fidelity of
      FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
      parameter details)
      * `backupable_options_` -> `engine_options_` as part of the ongoing
      work to get rid of the obsolete "backupable" naming.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
      
      Test Plan: test code updates only
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34622183
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
      ce60d0cb
  7. 29 1月, 2022 1 次提交
  8. 05 1月, 2022 1 次提交
  9. 11 12月, 2021 1 次提交
    • P
      More refactoring ahead of footer & meta changes (#9240) · 653c392e
      Peter Dillinger 提交于
      Summary:
      I'm working on a new format_version=6 to support context
      checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
      updates to support that change.
      
      Test coverage data and manual inspection agree on dead code in
      block_based_table_reader.cc (removed).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
      
      Test Plan:
      tests enhanced to cover more cases etc.
      
      Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
      
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
      
      (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
      Before w/ wal: 454512
      After w/ wal: 444820 (-2.1%)
      Before w/o wal: 1004560
      After w/o wal: 998897 (-0.6%)
      
      Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
      
      This regression will be corrected in a follow-up PR.
      
      Reviewed By: ajkr
      
      Differential Revision: D32813769
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
      653c392e
  10. 29 10月, 2021 1 次提交
    • P
      Implement XXH3 block checksum type (#9069) · a7d4bea4
      Peter Dillinger 提交于
      Summary:
      XXH3 - latest hash function that is extremely fast on large
      data, easily faster than crc32c on most any x86_64 hardware. In
      integrating this hash function, I have handled the compression type byte
      in a non-standard way to avoid using the streaming API (extra data
      movement and active code size because of hash function complexity). This
      approach got a thumbs-up from Yann Collet.
      
      Existing functionality change:
      * reject bad ChecksumType in options with InvalidArgument
      
      This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
      likely to be handled through different configuration than ChecksumType.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
      
      Test Plan:
      tests updated, and substantially expanded. Unit tests now check
      that we don't accidentally change the values generated by the checksum
      algorithms ("schema test") and that we properly handle
      invalid/unrecognized checksum types in options or in file footer.
      
      DBTestBase::ChangeOptions (etc.) updated from two to one configuration
      changing from default CRC32c ChecksumType. The point of this test code
      is to detect possible interactions among features, and the likelihood of
      some bad interaction being detected by including configurations other
      than XXH3 and CRC32c--and then not detected by stress/crash test--is
      extremely low.
      
      Stress/crash test also updated (manual run long enough to see it accepts
      new checksum type). db_bench also updated for microbenchmarking
      checksums.
      
       ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
      crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
      xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
      xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
      xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
      crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
      xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
      xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
      xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
      crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
      xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
      xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
      xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
      
      As you can see, especially once warmed up, xxh3 is fastest.
      
       ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      Test
      
          for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
      
      Results (ops/sec)
      
          for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
      
      results-0 252118 # kNoChecksum
      results-1 251588 # kCRC32c
      results-2 251863 # kxxHash
      results-3 252016 # kxxHash64
      results-4 252038 # kXXH3
      
      Reviewed By: mrambacher
      
      Differential Revision: D31905249
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
      a7d4bea4
  11. 19 10月, 2021 1 次提交
    • P
      Experimental support for SST unique IDs (#8990) · ad5325a7
      Peter Dillinger 提交于
      Summary:
      * New public header unique_id.h and function GetUniqueIdFromTableProperties
      which computes a universally unique identifier based on table properties
      of table files from recent RocksDB versions.
      * Generation of DB session IDs is refactored so that they are
      guaranteed unique in the lifetime of a process running RocksDB.
      (SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
      this enables SST unique IDs to be guaranteed unique among SSTs generated
      in a single process, and "better than random" between processes.
      See https://github.com/pdillinger/unique_id
      * In addition to public API producing 'external' unique IDs, there is a function
      for producing 'internal' unique IDs, with functions for converting between the
      two. In short, the external ID is "safe" for things people might do with it, and
      the internal ID enables more "power user" features for the future. Specifically,
      the external ID goes through a hashing layer so that any subset of bits in the
      external ID can be used as a hash of the full ID, while also preserving
      uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
      and on full 192 bits).
      
      Intended follow-up:
      * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
      the third 64-bit value of the unique ID.)
      * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
      
      Test Plan:
      Unit tests added, and checking of unique ids in stress test.
      NOTE in stress test we do not generate nearly enough files to thoroughly
      stress uniqueness, but the test trims off pieces of the ID to check for
      uniqueness so that we can infer (with some assumptions) stronger
      properties in the aggregate.
      
      Reviewed By: zhichao-cao, mrambacher
      
      Differential Revision: D31582865
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
      ad5325a7
  12. 29 9月, 2021 1 次提交
    • M
      Cleanup includes in dbformat.h (#8930) · 13ae16c3
      mrambacher 提交于
      Summary:
      This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.
      
      Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
      
      Hopefully, this sort of code hygiene cleanup will speed up the builds...
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
      
      Reviewed By: pdillinger
      
      Differential Revision: D31142788
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
      13ae16c3
  13. 08 9月, 2021 1 次提交
    • M
      Make MemTableRepFactory into a Customizable class (#8419) · beed8647
      mrambacher 提交于
      Summary:
      This PR does the following:
      -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
      -> Makes the existing implementations compatible with configurations
      -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
      
      New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29558961
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
      beed8647
  14. 16 8月, 2021 1 次提交
  15. 05 8月, 2021 1 次提交
    • A
      Do not attempt to rename non-existent info log (#8622) · a685a701
      Andrew Kryczka 提交于
      Summary:
      Previously we attempted to rename "LOG" to "LOG.old.*" without checking
      its existence first. "LOG" had no reason to exist in a new DB.
      
      Errors in renaming a non-existent "LOG" were swallowed via
      `PermitUncheckedError()` so things worked. However the storage service's
      error monitoring was detecting all these benign rename failures. So it
      is better to fix it. Also with this PR we can now distinguish rename failure
      for other reasons and return them.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622
      
      Test Plan: new unit test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D30115189
      
      Pulled By: ajkr
      
      fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170
      a685a701
  16. 27 7月, 2021 1 次提交
    • M
      Make EventListener into a Customizable Class (#8473) · 3aee4fbd
      mrambacher 提交于
      Summary:
      - Added Type/CreateFromString
      - Added ability to load EventListeners to DBOptions
      - Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29901488
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
      3aee4fbd
  17. 18 6月, 2021 1 次提交
    • A
      Cache warming data blocks during flush (#8242) · 5ba1b6e5
      Akanksha Mahajan 提交于
      Summary:
      This PR prepopulates warm/hot data blocks which are already in memory
      into block cache at the time of flush. On a flush, the data block that is
      in memory (in memtables) get flushed to the device. If using Direct IO,
      additional IO is incurred to read this data back into memory again, which
      is avoided by enabling newly added option.
      
       Right now, this is enabled only for flush for data blocks. We plan to
      expand this option to cover compactions in the future and for other types
       of blocks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
      
      Test Plan: Add new unit test
      
      Reviewed By: anand1976
      
      Differential Revision: D28521703
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
      5ba1b6e5
  18. 20 5月, 2021 1 次提交
    • P
      Use deleters to label cache entries and collect stats (#8297) · 311a544c
      Peter Dillinger 提交于
      Summary:
      This change gathers and publishes statistics about the
      kinds of items in block cache. This is especially important for
      profiling relative usage of cache by index vs. filter vs. data blocks.
      It works by iterating over the cache during periodic stats dump
      (InternalStats, stats_dump_period_sec) or on demand when
      DB::Get(Map)Property(kBlockCacheEntryStats), except that for
      efficiency and sharing among column families, saved data from
      the last scan is used when the data is not considered too old.
      
      The new information can be seen in info LOG, for example:
      
          Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
          Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
      
      And also through DB::GetProperty and GetMapProperty (here using
      ldb just for demonstration):
      
          $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
          rocksdb.block-cache-entry-stats.bytes.data-block: 0
          rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.bytes.index-block: 178992
          rocksdb.block-cache-entry-stats.bytes.misc: 0
          rocksdb.block-cache-entry-stats.bytes.other-block: 0
          rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
          rocksdb.block-cache-entry-stats.capacity: 8388608
          rocksdb.block-cache-entry-stats.count.data-block: 0
          rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.count.index-block: 215
          rocksdb.block-cache-entry-stats.count.misc: 1
          rocksdb.block-cache-entry-stats.count.other-block: 0
          rocksdb.block-cache-entry-stats.count.write-buffer: 0
          rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
          rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
          rocksdb.block-cache-entry-stats.percent.misc: 0.000000
          rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
          rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
          rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
      
      Solution detail - We need some way to flag what kind of blocks each
      entry belongs to, preferably without changing the Cache API.
      One of the complications is that Cache is a general interface that could
      have other users that don't adhere to whichever convention we decide
      on for keys and values. Or we would pay for an extra field in the Handle
      that would only be used for this purpose.
      
      This change uses a back-door approach, the deleter, to indicate the
      "role" of a Cache entry (in addition to the value type, implicitly).
      This has the added benefit of ensuring proper code origin whenever we
      recognize a particular role for a cache entry; if the entry came from
      some other part of the code, it will use an unrecognized deleter, which
      we simply attribute to the "Misc" role.
      
      An internal API makes for simple instantiation and automatic
      registration of Cache deleters for a given value type and "role".
      
      Another internal API, CacheEntryStatsCollector, solves the problem of
      caching the results of a scan and sharing them, to ensure scans are
      neither excessive nor redundant so as not to harm Cache performance.
      
      Because code is added to BlocklikeTraits, it is pulled out of
      block_based_table_reader.cc into its own file.
      
      This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
      (could still be added), and with actual stat gathering.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
      
      Test Plan: manual testing with db_bench, and a couple of basic unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28488721
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
      311a544c
  19. 14 5月, 2021 1 次提交
    • A
      Initial support for secondary cache in LRUCache (#8271) · feb06e83
      anand76 提交于
      Summary:
      Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
      
      Tests:
      Results from cache_bench and db_bench don't show any regression due to these changes.
      
      cache_bench results before and after this change -
      Command
      ```./cache_bench -ops_per_thread=10000000 -threads=1```
      Before
      ```Complete in 40.688 s; QPS = 245774```
      ```Complete in 40.486 s; QPS = 246996```
      ```Complete in 42.019 s; QPS = 237989```
      After
      ```Complete in 40.672 s; QPS = 245869```
      ```Complete in 44.622 s; QPS = 224107```
      ```Complete in 42.445 s; QPS = 235599```
      
      db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
      Commands
      ```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
      
      ```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
      Before
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
      ```
      After
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28357511
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
      feb06e83
  20. 12 5月, 2021 1 次提交
    • P
      New Cache API for gathering statistics (#8225) · 78a309bf
      Peter Dillinger 提交于
      Summary:
      Adds a new Cache::ApplyToAllEntries API that we expect to use
      (in follow-up PRs) for efficiently gathering block cache statistics.
      Notable features vs. old ApplyToAllCacheEntries:
      
      * Includes key and deleter (in addition to value and charge). We could
      have passed in a Handle but then more virtual function calls would be
      needed to get the "fields" of each entry. We expect to use the 'deleter'
      to identify the origin of entries, perhaps even more.
      * Heavily tuned to minimize latency impact on operating cache. It
      does this by iterating over small sections of each cache shard while
      cycling through the shards.
      * Supports tuning roughly how many entries to operate on for each
      lock acquire and release, to control the impact on the latency of other
      operations without excessive lock acquire & release. The right balance
      can depend on the cost of the callback. Good default seems to be
      around 256.
      * There should be no need to disable thread safety. (I would expect
      uncontended locks to be sufficiently fast.)
      
      I have enhanced cache_bench to validate this approach:
      
      * Reports a histogram of ns per operation, so we can look at the
      ditribution of times, not just throughput (average).
      * Can add a thread for simulated "gather stats" which calls
      ApplyToAllEntries at a specified interval. We also generate a histogram
      of time to run ApplyToAllEntries.
      
      To make the iteration over some entries of each shard work as cleanly as
      possible, even with resize between next set of entries, I have
      re-arranged which hash bits are used for sharding and which for indexing
      within a shard.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
      
      Test Plan:
      A couple of unit tests are added, but primary validation is manual, as
      the primary risk is to performance.
      
      The primary validation is using cache_bench to ensure that neither
      the minor hashing changes nor the simulated stats gathering
      significantly impact QPS or latency distribution. Note that adding op
      latency histogram seriously impacts the benchmark QPS, so for a
      fair baseline, we need the cache_bench changes (except remove simulated
      stat gathering to make it compile). In short, we don't see any
      reproducible difference in ops/sec or op latency unless we are gathering
      stats nearly continuously. Test uses 10GB block cache with
      8KB values to be somewhat realistic in the number of items to iterate
      over.
      
      Baseline typical output:
      
      ```
      Complete in 92.017 s; Rough parallel ops/sec = 869401
      Thread ops/sec = 54662
      
      Operation latency (ns):
      Count: 80000000 Average: 11223.9494  StdDev: 29.61
      Min: 0  Median: 7759.3973  Max: 9620500
      Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
      ------------------------------------------------------
      [       0,       1 ]       68   0.000%   0.000%
      (    2900,    4400 ]       89   0.000%   0.000%
      (    4400,    6600 ] 33630240  42.038%  42.038% ########
      (    6600,    9900 ] 18129842  22.662%  64.700% #####
      (    9900,   14000 ]  7877533   9.847%  74.547% ##
      (   14000,   22000 ] 15193238  18.992%  93.539% ####
      (   22000,   33000 ]  3037061   3.796%  97.335% #
      (   33000,   50000 ]  1626316   2.033%  99.368%
      (   50000,   75000 ]   421532   0.527%  99.895%
      (   75000,  110000 ]    56910   0.071%  99.966%
      (  110000,  170000 ]    16134   0.020%  99.986%
      (  170000,  250000 ]     5166   0.006%  99.993%
      (  250000,  380000 ]     3017   0.004%  99.996%
      (  380000,  570000 ]     1337   0.002%  99.998%
      (  570000,  860000 ]      805   0.001%  99.999%
      (  860000, 1200000 ]      319   0.000% 100.000%
      ( 1200000, 1900000 ]      231   0.000% 100.000%
      ( 1900000, 2900000 ]      100   0.000% 100.000%
      ( 2900000, 4300000 ]       39   0.000% 100.000%
      ( 4300000, 6500000 ]       16   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ```
      
      New, gather_stats=false. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 92.030 s; Rough parallel ops/sec = 869285
      Thread ops/sec = 54458
      
      Operation latency (ns):
      Count: 80000000 Average: 11298.1027  StdDev: 42.18
      Min: 0  Median: 7722.0822  Max: 6398720
      Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
      ------------------------------------------------------
      [       0,       1 ]      109   0.000%   0.000%
      (    2900,    4400 ]      793   0.001%   0.001%
      (    4400,    6600 ] 34054563  42.568%  42.569% #########
      (    6600,    9900 ] 17482646  21.853%  64.423% ####
      (    9900,   14000 ]  7908180   9.885%  74.308% ##
      (   14000,   22000 ] 15032072  18.790%  93.098% ####
      (   22000,   33000 ]  3237834   4.047%  97.145% #
      (   33000,   50000 ]  1736882   2.171%  99.316%
      (   50000,   75000 ]   446851   0.559%  99.875%
      (   75000,  110000 ]    68251   0.085%  99.960%
      (  110000,  170000 ]    18592   0.023%  99.983%
      (  170000,  250000 ]     7200   0.009%  99.992%
      (  250000,  380000 ]     3334   0.004%  99.997%
      (  380000,  570000 ]     1393   0.002%  99.998%
      (  570000,  860000 ]      700   0.001%  99.999%
      (  860000, 1200000 ]      293   0.000% 100.000%
      ( 1200000, 1900000 ]      196   0.000% 100.000%
      ( 1900000, 2900000 ]       69   0.000% 100.000%
      ( 2900000, 4300000 ]       32   0.000% 100.000%
      ( 4300000, 6500000 ]       10   0.000% 100.000%
      ```
      
      New, gather_stats=true, 1 second delay between scans. Scans take about
      1 second here so it's spending about 50% time scanning. Still the effect on
      ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 91.890 s; Rough parallel ops/sec = 870608
      Thread ops/sec = 54551
      
      Operation latency (ns):
      Count: 80000000 Average: 11311.2629  StdDev: 45.28
      Min: 0  Median: 7686.5458  Max: 10018340
      Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
      ------------------------------------------------------
      [       0,       1 ]       71   0.000%   0.000%
      (    2900,    4400 ]      291   0.000%   0.000%
      (    4400,    6600 ] 34492060  43.115%  43.116% #########
      (    6600,    9900 ] 16727328  20.909%  64.025% ####
      (    9900,   14000 ]  7845828   9.807%  73.832% ##
      (   14000,   22000 ] 15510654  19.388%  93.220% ####
      (   22000,   33000 ]  3216533   4.021%  97.241% #
      (   33000,   50000 ]  1680859   2.101%  99.342%
      (   50000,   75000 ]   439059   0.549%  99.891%
      (   75000,  110000 ]    60540   0.076%  99.967%
      (  110000,  170000 ]    14649   0.018%  99.985%
      (  170000,  250000 ]     5242   0.007%  99.991%
      (  250000,  380000 ]     3260   0.004%  99.995%
      (  380000,  570000 ]     1599   0.002%  99.997%
      (  570000,  860000 ]     1043   0.001%  99.999%
      (  860000, 1200000 ]      471   0.001%  99.999%
      ( 1200000, 1900000 ]      275   0.000% 100.000%
      ( 1900000, 2900000 ]      143   0.000% 100.000%
      ( 2900000, 4300000 ]       60   0.000% 100.000%
      ( 4300000, 6500000 ]       27   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ( 9800000, 14000000 ]        1   0.000% 100.000%
      
      Gather stats latency (us):
      Count: 46 Average: 980387.5870  StdDev: 60911.18
      Min: 879155  Median: 1033777.7778  Max: 1261431
      Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
      ------------------------------------------------------
      (  860000, 1200000 ]       45  97.826%  97.826% ####################
      ( 1200000, 1900000 ]        1   2.174% 100.000%
      
      Most recent cache entry stats:
      Number of entries: 1295133
      Total charge: 9.88 GB
      Average key size: 23.4982
      Average charge: 8.00 KB
      Unique deleters: 3
      ```
      
      Reviewed By: mrambacher
      
      Differential Revision: D28295742
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
      78a309bf
  21. 23 4月, 2021 1 次提交
    • Z
      Fix the false positive alert of CF consistency check in WAL recovery (#8207) · 09a9ec3a
      Zhichao Cao 提交于
      Summary:
      In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
      
      Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
      
      Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
      
      Test Plan: added unit test, make crash_test
      
      Reviewed By: riversand963
      
      Differential Revision: D27898839
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
      09a9ec3a
  22. 20 4月, 2021 2 次提交
    • J
      Fix unittest no space issue (#8204) · a89740fb
      Jay Zhuang 提交于
      Summary:
      Unittest reports no space from time to time, which can be reproduced on a small memory machine with SHM. It's caused by large WAL files generated during the test, which is preallocated, but didn't truncate during close(). Adding the missing APIs to set preallocation.
      It added arm test as nightly build, as the test runs more than 1 hour.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8204
      
      Test Plan: test on small memory arm machine
      
      Reviewed By: mrambacher
      
      Differential Revision: D27873145
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: f797c429d6bc13cbcc673bc03fcc72adda55f506
      a89740fb
    • Y
      Handle rename() failure in non-local FS (#8192) · a376c220
      Yanqin Jin 提交于
      Summary:
      In a distributed environment, a file `rename()` operation can succeed on server (remote)
      side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
      network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
      can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
      new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
      
      This is problematic in distributed world. If the server-side successfully updates the CURRENT
      file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
      
      As a fix, we can track the execution result of IO operations on the new MANIFEST.
      - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
        MANIFEST. Therefore, it is safe to remove the new MANIFEST.
      - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
        code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
        POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
        new MANIFEST.) Therefore, we keep the new MANIFEST.
          - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
          - If process reopens the db immediately after the failure, then the CURRENT file can point
            to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
            succeed and ignore the other.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
      
      Test Plan: make check
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27804648
      
      Pulled By: riversand963
      
      fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
      a376c220
  23. 08 4月, 2021 1 次提交
    • G
      Fix flush reason attribution (#8150) · 48cd7a3a
      Giuseppe Ottaviano 提交于
      Summary:
      Current flush reason attribution is misleading or incorrect (depending on what the original intention was):
      
      - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
      - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`
      
      This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27569645
      
      Pulled By: ot
      
      fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
      48cd7a3a
  24. 20 3月, 2021 1 次提交
  25. 18 3月, 2021 1 次提交
    • A
      Use SST file manager to track blob files as well (#8037) · 27d57a03
      Akanksha Mahajan 提交于
      Summary:
      Extend support to track blob files in SST File manager.
       This PR notifies SstFileManager whenever a new blob file is created,
       via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
       and delete file via ScheduleFileDeletion.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
      
      Test Plan: Add new unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D26891237
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
      27d57a03
  26. 26 1月, 2021 1 次提交
    • M
      Add a SystemClock class to capture the time functions of an Env (#7858) · 12f11373
      mrambacher 提交于
      Summary:
      Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
      
      Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
      
      There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
      
      Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
      
      Reviewed By: pdillinger
      
      Differential Revision: D26006406
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
      12f11373
  27. 07 1月, 2021 1 次提交
    • A
      Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715) · 6e0f62f2
      Adam Retter 提交于
      Summary:
      Third batch of adding more tests to ASSERT_STATUS_CHECKED.
      
      * db_compaction_filter_test
      * db_compaction_test
      * db_dynamic_level_test
      * db_inplace_update_test
      * db_sst_test
      * db_tailing_iter_test
      * db_io_failure_test
      
      Also update GetApproximateSizes APIs to all return Status.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D25806896
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
      6e0f62f2
  28. 28 10月, 2020 1 次提交
    • M
      Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566) · f35f7f27
      mrambacher 提交于
      Summary:
      This PR does a few things:
      
      1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
      
      2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
      - The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
      - The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
      
      3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
      
      I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
      
      Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
      
      Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D24408980
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
      f35f7f27
  29. 30 9月, 2020 1 次提交
  30. 15 9月, 2020 1 次提交
  31. 26 8月, 2020 1 次提交
    • S
      Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297) · 722814e3
      sdong 提交于
      Summary:
      Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
      
      Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
      
      Reviewed By: anand1976
      
      Differential Revision: D23257950
      
      fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
      722814e3
  32. 20 8月, 2020 1 次提交
    • J
      Fix a timer_test deadlock (#7277) · 3e422ce0
      Jay Zhuang 提交于
      Summary:
      There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277
      
      Reviewed By: pdillinger
      
      Differential Revision: D23183873
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
      3e422ce0
  33. 18 8月, 2020 1 次提交
  34. 15 8月, 2020 1 次提交
    • J
      Introduce a global StatsDumpScheduler for stats dumping (#7223) · 69760b4d
      Jay Zhuang 提交于
      Summary:
      Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
      
      Reviewed By: riversand963
      
      Differential Revision: D23056737
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
      69760b4d
  35. 12 8月, 2020 1 次提交
    • P
      Fix+clean up handling of mock sleeps (#7101) · 6ac1d25f
      Peter Dillinger 提交于
      Summary:
      We have a number of tests hanging on MacOS and windows due to
      mishandling of code for mock sleeps. In addition, the code was in
      terrible shape because the same variable (addon_time_) would sometimes
      refer to microseconds and sometimes to seconds. One test even assumed it
      was nanoseconds but was written to pass anyway.
      
      This has been cleaned up so that DB tests generally use a SpecialEnv
      function to mock sleep, for either some number of microseconds or seconds
      depending on the function called. But to call one of these, the test must first
      call SetMockSleep (precondition enforced with assertion), which also turns
      sleeps in RocksDB into mock sleeps. To also removes accounting for actual
      clock time, call SetTimeElapseOnlySleepOnReopen, which implies
      SetMockSleep (on DB re-open). This latter setting only works by applying
      on DB re-open, otherwise havoc can ensue if Env goes back in time with
      DB open.
      
      More specifics:
      
      Removed some unused test classes, and updated comments on the general
      problem.
      
      Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
      of mock time. For this we have the only modification to production code,
      inserting a sync point callback in flush_job.cc, which is not a change to
      production behavior.
      
      Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
      deals in relative time. Any behaviors relying on absolute date/time are likely
      a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
      clearly injecting a specific absolute time for actual testing convenience.) Just
      in case I misunderstood some test, I put this note in each replacement:
      // NOTE: Presumed unnecessary and removed: resetting mock time in env
      
      Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
      FilterCompactionTimeTest in db_test.cc
      
      stats_history_test and blob_db_test are each their own beast, rather deeply
      dependent on MockTimeEnv. Each gets its own variant of a work-around for
      TimedWait in a mock time environment. (Reduces redundancy and
      inconsistency in stats_history_test.)
      
      Intended follow-up:
      
      Remove TimedWait from the public API of InstrumentedCondVar, and only
      make that accessible through Env by passing in an InstrumentedCondVar and
      a deadline. Then the Env implementations mocking time can fix this problem
      without using sync points. (Test infrastructure using sync points interferes
      with individual tests' control over sync points.)
      
      With that change, we can simplify/consolidate the scattered work-arounds.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
      
      Test Plan: make check on Linux and MacOS
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D23032815
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
      6ac1d25f
  36. 10 7月, 2020 1 次提交
    • M
      More Makefile Cleanup (#7097) · c7c7b07f
      mrambacher 提交于
      Summary:
      Cleans up some of the dependencies on test code in the Makefile while building tools:
      - Moves the test::RandomString, DBBaseTest::RandomString into Random
      - Moves the test::RandomHumanReadableString into Random
      - Moves the DestroyDir method into file_utils
      - Moves the SetupSyncPointsToMockDirectIO into sync_point.
      - Moves the FaultInjection Env and FS classes under env
      
      These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.
      
      Tested both release and debug builds via Make and CMake for both static and shared libraries.
      
      More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097
      
      Reviewed By: riversand963
      
      Differential Revision: D22463160
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
      c7c7b07f
  37. 03 7月, 2020 2 次提交
  38. 02 7月, 2020 1 次提交