1. 20 5月, 2021 5 次提交
    • P
      Use deleters to label cache entries and collect stats (#8297) · 311a544c
      Peter Dillinger 提交于
      Summary:
      This change gathers and publishes statistics about the
      kinds of items in block cache. This is especially important for
      profiling relative usage of cache by index vs. filter vs. data blocks.
      It works by iterating over the cache during periodic stats dump
      (InternalStats, stats_dump_period_sec) or on demand when
      DB::Get(Map)Property(kBlockCacheEntryStats), except that for
      efficiency and sharing among column families, saved data from
      the last scan is used when the data is not considered too old.
      
      The new information can be seen in info LOG, for example:
      
          Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
          Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
      
      And also through DB::GetProperty and GetMapProperty (here using
      ldb just for demonstration):
      
          $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
          rocksdb.block-cache-entry-stats.bytes.data-block: 0
          rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.bytes.index-block: 178992
          rocksdb.block-cache-entry-stats.bytes.misc: 0
          rocksdb.block-cache-entry-stats.bytes.other-block: 0
          rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
          rocksdb.block-cache-entry-stats.capacity: 8388608
          rocksdb.block-cache-entry-stats.count.data-block: 0
          rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.count.index-block: 215
          rocksdb.block-cache-entry-stats.count.misc: 1
          rocksdb.block-cache-entry-stats.count.other-block: 0
          rocksdb.block-cache-entry-stats.count.write-buffer: 0
          rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
          rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
          rocksdb.block-cache-entry-stats.percent.misc: 0.000000
          rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
          rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
          rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
      
      Solution detail - We need some way to flag what kind of blocks each
      entry belongs to, preferably without changing the Cache API.
      One of the complications is that Cache is a general interface that could
      have other users that don't adhere to whichever convention we decide
      on for keys and values. Or we would pay for an extra field in the Handle
      that would only be used for this purpose.
      
      This change uses a back-door approach, the deleter, to indicate the
      "role" of a Cache entry (in addition to the value type, implicitly).
      This has the added benefit of ensuring proper code origin whenever we
      recognize a particular role for a cache entry; if the entry came from
      some other part of the code, it will use an unrecognized deleter, which
      we simply attribute to the "Misc" role.
      
      An internal API makes for simple instantiation and automatic
      registration of Cache deleters for a given value type and "role".
      
      Another internal API, CacheEntryStatsCollector, solves the problem of
      caching the results of a scan and sharing them, to ensure scans are
      neither excessive nor redundant so as not to harm Cache performance.
      
      Because code is added to BlocklikeTraits, it is pulled out of
      block_based_table_reader.cc into its own file.
      
      This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
      (could still be added), and with actual stat gathering.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
      
      Test Plan: manual testing with db_bench, and a couple of basic unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28488721
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
      311a544c
    • G
      Add StartThread type checking wrapper (#8303) · 748e3acc
      Glebanister 提交于
      Summary:
      - Add class `FunctorWrapper` to invoke the function with given parameters
      - Implement `StartThreadTyped` which wraps `StartThread` with type checking cover
      - Demonstrate `StartThreadTyped` in test `util/thread_local_test.cc`
      
      https://github.com/facebook/rocksdb/issues/8285
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8303
      
      Reviewed By: ajkr
      
      Differential Revision: D28539318
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 624789c236bde31163deda95c1e1471aee68933e
      748e3acc
    • A
      Allow cache_bench/db_bench to use a custom secondary cache (#8312) · 13232e11
      anand76 提交于
      Summary:
      This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.
      
      The main cache_bench code is packaged into a separate library, similar to db_bench.
      
      An example invocation of db_bench with a secondary cache URI -
      ```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true  -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28544325
      
      Pulled By: anand1976
      
      fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
      13232e11
    • S
      Fix test issue in new env_test tests (#8319) · 871a2cb2
      sdong 提交于
      Summary:
      The two new tests added to env_test don't clear sync points, so if tests are run in continuous mode, rather than parallel mode, the next test will trigger previous sync point and fail. Fix it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8319
      
      Test Plan: Run the tests in continuous mode which used to fail and see them passing.
      
      Reviewed By: pdillinger
      
      Differential Revision: D28542562
      
      fbshipit-source-id: 4052d487635188fe68a2a9df4b03d97b23f96720
      871a2cb2
    • S
      Minor improvements in env_test (#8317) · ce0fc71a
      sdong 提交于
      Summary:
      Fix typo in comments in env_test and add PermitUncheckedError() to two statuses.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8317
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28525093
      
      fbshipit-source-id: 7a1ed3e45b6f500b8d2ae19fa339c9368111e922
      ce0fc71a
  2. 19 5月, 2021 3 次提交
    • A
      Sync ingested files only if reopen is supported by the FS (#8296) · 9d61a085
      anand76 提交于
      Summary:
      Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported().
      
      Tests:
      Add a new unit test in external_sst_file_basic_test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28420865
      
      Pulled By: anand1976
      
      fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c
      9d61a085
    • S
      Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe() (#8311) · 60e5af83
      sdong 提交于
      Summary:
      Right now return codes by io_uring_submit_and_wait() and io_uring_wait_cqe() are not handled. It is not the good practice. Although these two functions are not supposed to return non-0 values in normal exeuction, people suspect that they might return non-0 value when an interruption happens, and the code might cause hanging.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8311
      
      Test Plan: Make sure at least normal test cases still pass.
      
      Reviewed By: anand1976
      
      Differential Revision: D28500828
      
      fbshipit-source-id: 8a76cea9cafbd041102e0b6a8eef9d0bfed7c211
      60e5af83
    • M
      Fix MultiGet with PinnableSlices and Merge for WBWI (#8299) · 6b0a22a4
      mrambacher 提交于
      Summary:
      The MultiGetFromBatchAndDB would fail if the PinnableSlice value being returned was pinned.  This could happen if the value was retrieved from the DB (not memtable) or potentially if the values were reused (and a previous iteration returned a slice that was pinned).
      
      This change resets the pinnable value to clear it prior to attempting to use it, thereby eliminating the problem with the value already being pinned.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8299
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28455426
      
      Pulled By: mrambacher
      
      fbshipit-source-id: a34d7d983ec9b6bb4c8a2b4892f72858d43e6972
      6b0a22a4
  3. 18 5月, 2021 3 次提交
    • S
      Expose CompressionOptions::parallel_threads through C API (#8302) · 83d1a665
      Stanislav Tkach 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8302
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28499262
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7b17b79af871d874dfca76db9bca0d640a6cd854
      83d1a665
    • L
      Make it possible to apply only a subrange of table property collectors (#8298) · d83542ca
      Levi Tamasi 提交于
      Summary:
      This patch does two things:
      1) Introduces some aliases in order to eliminate/prevent long-winded type names
      w/r/t the internal table property collectors (see e.g.
      `std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`).
      2) Makes it possible to apply only a subrange of table property collectors during
      table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories`
      from a pointer to a `vector` into a range (i.e. a pair of iterators).
      
      Rationale: I plan to introduce a BlobDB related table property collector, which
      should only be applied during table creation if blob storage is enabled at the moment
      (which can be changed dynamically). This change will make it possible to include/
      exclude the BlobDB related collector as needed without having to introduce
      a second `vector` of collectors in `ColumnFamilyData` with pretty much the same
      contents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298
      
      Test Plan: `make check`
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28430910
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e
      d83542ca
    • S
      Write file temperature information to manifest (#8284) · 0ed8cb66
      sdong 提交于
      Summary:
      As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.
      
      Also fix some issues in simulated hybrid FS.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284
      
      Test Plan: Add a new unit test to validate that the information is indeed written and read back.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28335801
      
      fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
      0ed8cb66
  4. 14 5月, 2021 2 次提交
    • A
      Initial support for secondary cache in LRUCache (#8271) · feb06e83
      anand76 提交于
      Summary:
      Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
      
      Tests:
      Results from cache_bench and db_bench don't show any regression due to these changes.
      
      cache_bench results before and after this change -
      Command
      ```./cache_bench -ops_per_thread=10000000 -threads=1```
      Before
      ```Complete in 40.688 s; QPS = 245774```
      ```Complete in 40.486 s; QPS = 246996```
      ```Complete in 42.019 s; QPS = 237989```
      After
      ```Complete in 40.672 s; QPS = 245869```
      ```Complete in 44.622 s; QPS = 224107```
      ```Complete in 42.445 s; QPS = 235599```
      
      db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
      Commands
      ```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
      
      ```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
      Before
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
      ```
      After
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28357511
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
      feb06e83
    • J
      Refactor Option obj address from char* to void* (#8295) · d15fbae4
      Jay Zhuang 提交于
      Summary:
      And replace `reinterpret_cast` with `static_cast` or no cast.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8295
      
      Test Plan: `make check`
      
      Reviewed By: mrambacher
      
      Differential Revision: D28420303
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 645be123a0df624dc2bea37cd54a35403fc494fa
      d15fbae4
  5. 13 5月, 2021 4 次提交
  6. 12 5月, 2021 2 次提交
    • P
      New Cache API for gathering statistics (#8225) · 78a309bf
      Peter Dillinger 提交于
      Summary:
      Adds a new Cache::ApplyToAllEntries API that we expect to use
      (in follow-up PRs) for efficiently gathering block cache statistics.
      Notable features vs. old ApplyToAllCacheEntries:
      
      * Includes key and deleter (in addition to value and charge). We could
      have passed in a Handle but then more virtual function calls would be
      needed to get the "fields" of each entry. We expect to use the 'deleter'
      to identify the origin of entries, perhaps even more.
      * Heavily tuned to minimize latency impact on operating cache. It
      does this by iterating over small sections of each cache shard while
      cycling through the shards.
      * Supports tuning roughly how many entries to operate on for each
      lock acquire and release, to control the impact on the latency of other
      operations without excessive lock acquire & release. The right balance
      can depend on the cost of the callback. Good default seems to be
      around 256.
      * There should be no need to disable thread safety. (I would expect
      uncontended locks to be sufficiently fast.)
      
      I have enhanced cache_bench to validate this approach:
      
      * Reports a histogram of ns per operation, so we can look at the
      ditribution of times, not just throughput (average).
      * Can add a thread for simulated "gather stats" which calls
      ApplyToAllEntries at a specified interval. We also generate a histogram
      of time to run ApplyToAllEntries.
      
      To make the iteration over some entries of each shard work as cleanly as
      possible, even with resize between next set of entries, I have
      re-arranged which hash bits are used for sharding and which for indexing
      within a shard.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
      
      Test Plan:
      A couple of unit tests are added, but primary validation is manual, as
      the primary risk is to performance.
      
      The primary validation is using cache_bench to ensure that neither
      the minor hashing changes nor the simulated stats gathering
      significantly impact QPS or latency distribution. Note that adding op
      latency histogram seriously impacts the benchmark QPS, so for a
      fair baseline, we need the cache_bench changes (except remove simulated
      stat gathering to make it compile). In short, we don't see any
      reproducible difference in ops/sec or op latency unless we are gathering
      stats nearly continuously. Test uses 10GB block cache with
      8KB values to be somewhat realistic in the number of items to iterate
      over.
      
      Baseline typical output:
      
      ```
      Complete in 92.017 s; Rough parallel ops/sec = 869401
      Thread ops/sec = 54662
      
      Operation latency (ns):
      Count: 80000000 Average: 11223.9494  StdDev: 29.61
      Min: 0  Median: 7759.3973  Max: 9620500
      Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
      ------------------------------------------------------
      [       0,       1 ]       68   0.000%   0.000%
      (    2900,    4400 ]       89   0.000%   0.000%
      (    4400,    6600 ] 33630240  42.038%  42.038% ########
      (    6600,    9900 ] 18129842  22.662%  64.700% #####
      (    9900,   14000 ]  7877533   9.847%  74.547% ##
      (   14000,   22000 ] 15193238  18.992%  93.539% ####
      (   22000,   33000 ]  3037061   3.796%  97.335% #
      (   33000,   50000 ]  1626316   2.033%  99.368%
      (   50000,   75000 ]   421532   0.527%  99.895%
      (   75000,  110000 ]    56910   0.071%  99.966%
      (  110000,  170000 ]    16134   0.020%  99.986%
      (  170000,  250000 ]     5166   0.006%  99.993%
      (  250000,  380000 ]     3017   0.004%  99.996%
      (  380000,  570000 ]     1337   0.002%  99.998%
      (  570000,  860000 ]      805   0.001%  99.999%
      (  860000, 1200000 ]      319   0.000% 100.000%
      ( 1200000, 1900000 ]      231   0.000% 100.000%
      ( 1900000, 2900000 ]      100   0.000% 100.000%
      ( 2900000, 4300000 ]       39   0.000% 100.000%
      ( 4300000, 6500000 ]       16   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ```
      
      New, gather_stats=false. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 92.030 s; Rough parallel ops/sec = 869285
      Thread ops/sec = 54458
      
      Operation latency (ns):
      Count: 80000000 Average: 11298.1027  StdDev: 42.18
      Min: 0  Median: 7722.0822  Max: 6398720
      Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
      ------------------------------------------------------
      [       0,       1 ]      109   0.000%   0.000%
      (    2900,    4400 ]      793   0.001%   0.001%
      (    4400,    6600 ] 34054563  42.568%  42.569% #########
      (    6600,    9900 ] 17482646  21.853%  64.423% ####
      (    9900,   14000 ]  7908180   9.885%  74.308% ##
      (   14000,   22000 ] 15032072  18.790%  93.098% ####
      (   22000,   33000 ]  3237834   4.047%  97.145% #
      (   33000,   50000 ]  1736882   2.171%  99.316%
      (   50000,   75000 ]   446851   0.559%  99.875%
      (   75000,  110000 ]    68251   0.085%  99.960%
      (  110000,  170000 ]    18592   0.023%  99.983%
      (  170000,  250000 ]     7200   0.009%  99.992%
      (  250000,  380000 ]     3334   0.004%  99.997%
      (  380000,  570000 ]     1393   0.002%  99.998%
      (  570000,  860000 ]      700   0.001%  99.999%
      (  860000, 1200000 ]      293   0.000% 100.000%
      ( 1200000, 1900000 ]      196   0.000% 100.000%
      ( 1900000, 2900000 ]       69   0.000% 100.000%
      ( 2900000, 4300000 ]       32   0.000% 100.000%
      ( 4300000, 6500000 ]       10   0.000% 100.000%
      ```
      
      New, gather_stats=true, 1 second delay between scans. Scans take about
      1 second here so it's spending about 50% time scanning. Still the effect on
      ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 91.890 s; Rough parallel ops/sec = 870608
      Thread ops/sec = 54551
      
      Operation latency (ns):
      Count: 80000000 Average: 11311.2629  StdDev: 45.28
      Min: 0  Median: 7686.5458  Max: 10018340
      Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
      ------------------------------------------------------
      [       0,       1 ]       71   0.000%   0.000%
      (    2900,    4400 ]      291   0.000%   0.000%
      (    4400,    6600 ] 34492060  43.115%  43.116% #########
      (    6600,    9900 ] 16727328  20.909%  64.025% ####
      (    9900,   14000 ]  7845828   9.807%  73.832% ##
      (   14000,   22000 ] 15510654  19.388%  93.220% ####
      (   22000,   33000 ]  3216533   4.021%  97.241% #
      (   33000,   50000 ]  1680859   2.101%  99.342%
      (   50000,   75000 ]   439059   0.549%  99.891%
      (   75000,  110000 ]    60540   0.076%  99.967%
      (  110000,  170000 ]    14649   0.018%  99.985%
      (  170000,  250000 ]     5242   0.007%  99.991%
      (  250000,  380000 ]     3260   0.004%  99.995%
      (  380000,  570000 ]     1599   0.002%  99.997%
      (  570000,  860000 ]     1043   0.001%  99.999%
      (  860000, 1200000 ]      471   0.001%  99.999%
      ( 1200000, 1900000 ]      275   0.000% 100.000%
      ( 1900000, 2900000 ]      143   0.000% 100.000%
      ( 2900000, 4300000 ]       60   0.000% 100.000%
      ( 4300000, 6500000 ]       27   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ( 9800000, 14000000 ]        1   0.000% 100.000%
      
      Gather stats latency (us):
      Count: 46 Average: 980387.5870  StdDev: 60911.18
      Min: 879155  Median: 1033777.7778  Max: 1261431
      Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
      ------------------------------------------------------
      (  860000, 1200000 ]       45  97.826%  97.826% ####################
      ( 1200000, 1900000 ]        1   2.174% 100.000%
      
      Most recent cache entry stats:
      Number of entries: 1295133
      Total charge: 9.88 GB
      Average key size: 23.4982
      Average charge: 8.00 KB
      Unique deleters: 3
      ```
      
      Reviewed By: mrambacher
      
      Differential Revision: D28295742
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
      78a309bf
    • M
      Added static methods for simple types to OptionTypeInfo (#8249) · 78e82410
      mrambacher 提交于
      Summary:
      Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo.  These methods can be used for serialization and deserialization of basic types.
      
      Change the MutableCF/DB Options to use this format.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28351190
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
      78e82410
  7. 11 5月, 2021 2 次提交
    • M
      Add ObjectRegistry to ConfigOptions (#8166) · 9f2d255a
      mrambacher 提交于
      Summary:
      This change enables a couple of things:
      - Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
      - The ObjectRegistry is created fewer times and can be re-used
      
      The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27657952
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
      9f2d255a
    • M
      Add Merge Operator support to WriteBatchWithIndex (#8135) · ff463742
      mrambacher 提交于
      Summary:
      The WBWI has two differing modes of operation dependent on the value
      of the constructor parameter `overwrite_key`.
      Currently, regardless of the parameter, neither mode performs as
      expected when using Merge. This PR remedies this by correctly invoking
      the appropriate Merge Operator before returning results from the WBWI.
      
      Examples of issues that exist which are solved by this PR:
      
      ## Example 1 with `overwrite_key=false`
      Currently, from an empty database, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      Get('k1')
      ```
      Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.
      
      ## Example 2 with o`verwrite_key=true`
      Currently, from an empty database, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      Get('k1')
      ```
      Incorrectly yields `ERROR: kMergeInProgress`.
      
      ## Example 3 with `overwrite_key=false`
      Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
      ```
      Merge('k1', 'v2')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `v1,v2`
      
      ## Example 4 with `overwrite_key=true`
      Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
      ```
      Merge('k1', 'v1')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `ERROR: kMergeInProgress`.
      
      ## Example 5 with `overwrite_key=false`
      Currently, from an empty database, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `v1,v2`
      
      ## Example 6 with `overwrite_key=true`
      Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `ERROR: kMergeInProgress`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135
      
      Reviewed By: pdillinger
      
      Differential Revision: D27657938
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
      ff463742
  8. 08 5月, 2021 5 次提交
  9. 07 5月, 2021 1 次提交
  10. 06 5月, 2021 5 次提交
    • A
      Permit stdout "fail"/"error" in whitebox crash test (#8272) · b71b4597
      Andrew Kryczka 提交于
      Summary:
      In https://github.com/facebook/rocksdb/issues/8268, the `db_stress` stdout began containing both the strings
      "fail" and "error" (case-insensitive). The whitebox crash test
      failed upon seeing either of those strings.
      
      I checked that all other occurrences of "fail" and "error"
      (case-insensitive) that `db_stress` produces are printed to `stderr`. So
      this PR separates the handling of `db_stress`'s stdout and stderr, and
      only fails when one those bad strings are found in stderr.
      
      The downside of this PR is `db_stress`'s original interleaving of stdout/stderr is not preserved in `db_crashtest.py`'s output.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8272
      
      Test Plan:
      run it; see it succeeds for several runs until encountering a real error
      
      ```
      $ python3 tools/db_crashtest.py whitebox --simple --random_kill_odd=8887 --max_key=1000000 --value_size_mult=33
      ...
      db_stress: cache/clock_cache.cc:483: bool rocksdb::{anonymous}::ClockCacheShard::Unref(rocksdb::{anonymous}::CacheHandle*, bool, rocksdb::{anonymous}::CleanupContext*): Assertion `CountRefs(flags) > 0' failed.
      
      TEST FAILED. Output has 'fail'!!!
      ```
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28239233
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3b8602a0d570466a7e2c81bb9c49468f7716091e
      b71b4597
    • S
      db_stress: wait for compaction to finish after open with failure injection (#8270) · 7f3a0f5b
      sdong 提交于
      Summary:
      When injecting in DB open, error can happen in background threads, causing DB open succeed, but DB is soon made read-only and subsequence writes will fail, which is not expected. To prevent it from happening, wait for compaction to finish before serving the traffic. If there is a failure, reopen.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8270
      
      Test Plan: Run the test.
      
      Reviewed By: ajkr
      
      Differential Revision: D28230537
      
      fbshipit-source-id: e2e97888904f9b9bb50c35ccf95b88c2319ef5c3
      7f3a0f5b
    • S
      Refactor kill point (#8241) · e19908cb
      sdong 提交于
      Summary:
      Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241
      
      Test Plan: make release and run crash test for a while.
      
      Reviewed By: anand1976
      
      Differential Revision: D28078486
      
      fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f
      e19908cb
    • M
      Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262) · 8948dc85
      mrambacher 提交于
      Summary:
      The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
      
      Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
      
      Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
      
      Reviewed By: pdillinger
      
      Differential Revision: D28226540
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
      8948dc85
    • A
      Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268) · 0f42e50f
      Andrew Kryczka 提交于
      Summary:
      See release note in HISTORY.md.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268
      
      Test Plan: unit test repro
      
      Reviewed By: siying
      
      Differential Revision: D28227901
      
      Pulled By: ajkr
      
      fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
      0f42e50f
  11. 05 5月, 2021 4 次提交
    • P
      Fix use-after-free threading bug in ClockCache (#8261) · 3b981eaa
      Peter Dillinger 提交于
      Summary:
      In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
      -use_clock_cache, as well as db_bench -use_clock_cache, but not
      single-threaded. Smaller cache size hits failure much faster. ASAN
      reported the failuer as calling malloc_usable_size on the `key` pointer
      of a ClockCache handle after it was reportedly freed. On detailed
      inspection I found this bad sequence of operations for a cache entry:
      
      state=InCache=1,refs=1
      [thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
      [thread 1] Decrement ref count
      state=InCache=1,refs=0
      [thread 1] Suspend before CalcTotalCharge (no mutex)
      
      [thread 2] Start UnsetInCache (from Insert, mutex held)
      [thread 2] clear InCache bit
      state=InCache=0,refs=0
      [thread 2] Calls RecycleHandle (based on pre-updated state)
      [thread 2] Returns to Insert which calls Cleanup which deletes `key`
      
      [thread 1] Resume ClockCacheShard::Unref
      [thread 1] Read `key` in CalcTotalCharge
      
      To fix this, I've added a field to the handle to store the metadata
      charge so that we can efficiently remember everything we need from
      the handle in Unref. We must not read from the handle again if we
      decrement the count to zero with InCache=1, which means we don't own
      the entry and someone else could eject/overwrite it immediately.
      
      Note before this change, on amd64 sizeof(Handle) == 56 even though there
      are only 48 bytes of data. Grouping together the uint32_t fields would
      cut it down to 48, but I've added another uint32_t, which takes it
      back up to 56. Not a big deal.
      
      Also fixed DisownData to cooperate with ASAN as in LRUCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261
      
      Test Plan:
      Manual + adding use_clock_cache to db_crashtest.py
      
      Base performance
      ./cache_bench -use_clock_cache
      Complete in 17.060 s; QPS = 2458513
      New performance
      ./cache_bench -use_clock_cache
      Complete in 17.052 s; QPS = 2459695
      
      Any difference is easily buried in small noise.
      
      Crash test shows still more bug(s) in ClockCache, so I'm expecting to
      disable ClockCache from production code in a follow-up PR (if we
      can't find and fix the bug(s))
      
      Reviewed By: mrambacher
      
      Differential Revision: D28207358
      
      Pulled By: pdillinger
      
      fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
      3b981eaa
    • A
      Fix ConcurrentTaskLimiter token release for shutdown (#8253) · c70bae1b
      Andrew Kryczka 提交于
      Summary:
      Previously the shutdown process did not properly wait for all
      `compaction_thread_limiter` tokens to be released before proceeding to
      delete the DB's C++ objects. When this happened, we saw tests like
      "DBCompactionTest.CompactionLimiter" flake with the following error:
      
      ```
      virtual
      rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl():
      Assertion `outstanding_tasks_ == 0' failed.
      ```
      
      There is a case where a token can still be alive even after the shutdown
      process has waited for BG work to complete. In particular, this happens
      because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all
      reach zero. These counters are decremented in `BackgroundCallCompaction()`
      functions. However, tokens are released in `BGWork*Compaction()` functions, which
      actually wrap the `BackgroundCallCompaction()` function.
      
      A simple sleep could repro the race condition:
      
      ```
      $ diff --git a/db/db_impl/db_impl_compaction_flush.cc
      b/db/db_impl/db_impl_compaction_flush.cc
      index 806bc548a..ba59efa89 100644
       --- a/db/db_impl/db_impl_compaction_flush.cc
      +++ b/db/db_impl/db_impl_compaction_flush.cc
      @@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void* arg) {
             static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
         static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
             prepicked_compaction, Env::Priority::LOW);
      +  sleep(1);
         delete prepicked_compaction;
       }
      
      $ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter
      db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed.
      Received signal 6 (Aborted)
      #0   /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ??      ??:0
      https://github.com/facebook/rocksdb/issues/1   /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ??       ??:0
      ...
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253
      
      Test Plan: sleeps to expose race conditions
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D28168064
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160
      c70bae1b
    • A
      Deflake DBTest.L0L1L2AndUpHitCounter (#8259) · c2a3424d
      Andrew Kryczka 提交于
      Summary:
      Previously we saw flakes on platforms like arm on CircleCI, such as the following:
      
      ```
      Note: Google Test filter = DBTest.L0L1L2AndUpHitCounter
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from DBTest
      [ RUN      ] DBTest.L0L1L2AndUpHitCounter
      db/db_test.cc:5345: Failure
      Expected: (TestGetTickerCount(options, GET_HIT_L0)) > (100), actual: 30 vs 100
      [  FAILED  ] DBTest.L0L1L2AndUpHitCounter (150 ms)
      [----------] 1 test from DBTest (150 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (150 ms total)
      [  PASSED  ] 0 tests.
      [  FAILED  ] 1 test, listed below:
      [  FAILED  ] DBTest.L0L1L2AndUpHitCounter
      ```
      
      The test was totally non-deterministic, e.g., flush/compaction timing would affect how many files on each level. Furthermore, it depended heavily on platform-specific details, e.g., by having a 32KB memtable, it could become full with a very different number of entries depending on the platform.
      
      This PR rewrites the test to build a deterministic LSM with one file per level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8259
      
      Reviewed By: mrambacher
      
      Differential Revision: D28178100
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0a03b26e8d23c29d8297c1bccb1b115dce33bdcd
      c2a3424d
    • J
      Update CircleCI MacOS Xcode version to 11.3.0 (#8256) · 8a92564a
      Jay Zhuang 提交于
      Summary:
      To fix CircleCI pyenv installation failure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8256
      
      Reviewed By: ajkr
      
      Differential Revision: D28191772
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 2bbb1d5ded473e510c11c8ed27884c4ad073973f
      8a92564a
  12. 04 5月, 2021 1 次提交
    • S
      Hint temperature of bottommost level files to FileSystem (#8222) · c3ff14e2
      sdong 提交于
      Summary:
      As the first part of the effort of having placing different files on different storage types, this change introduces several things:
      (1) An experimental interface in FileSystem that specify temperature to a new file created.
      (2) A test FileSystemWrapper,  SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
      (3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
      (4) A db_bench parameter that applies the (2) and (3) to db_bench.
      
      The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222
      
      Test Plan:
      ./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000  -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000
      
      followed by
      
      ./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000
      
      and see results as expected.
      
      Reviewed By: ajkr
      
      Differential Revision: D28003028
      
      fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
      c3ff14e2
  13. 01 5月, 2021 1 次提交
    • P
      Add more LSM info to FilterBuildingContext (#8246) · d2ca04e3
      Peter Dillinger 提交于
      Summary:
      Add `num_levels`, `is_bottommost`, and table file creation
      `reason` to `FilterBuildingContext`, in anticipation of more powerful
      Bloom-like filter support.
      
      To support this, added `is_bottommost` and `reason` to
      `TableBuilderOptions`, which allowed removing `reason` parameter from
      `rocksdb::BuildTable`.
      
      I attempted to remove `skip_filters` from `TableBuilderOptions`, because
      filter construction decisions should arise from options, not one-off
      parameters. I could not completely remove it because the public API for
      SstFileWriter takes a `skip_filters` parameter, and translating this
      into an option change would mean awkwardly replacing the table_factory
      if it is BlockBasedTableFactory with new filter_policy=nullptr option.
      I marked this public skip_filters option as deprecated because of this
      oddity. (skip_filters on the read side probably makes sense.)
      
      At least `skip_filters` is now largely hidden for users of
      `TableBuilderOptions` and is no longer used for implementing the
      optimize_filters_for_hits option. Bringing the logic for that option
      closer to handling of FilterBuildingContext makes it more obvious that
      hese two are using the same notion of "bottommost." (Planned:
      configuration options for Bloom-like filters that generalize
      `optimize_filters_for_hits`)
      
      Recommended follow-up: Try to get away from "bottommost level" naming of
      things, which is inaccurate (see
      VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
      "bottommost run" or just "bottommost."
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
      
      Test Plan:
      extended an existing unit test to exercise and check various
      filter building contexts. Also, existing tests for
      optimize_filters_for_hits validate some of the "bottommost" handling,
      which is now closely connected to FilterBuildingContext::is_bottommost
      through TableBuilderOptions::is_bottommost
      
      Reviewed By: mrambacher
      
      Differential Revision: D28099346
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
      d2ca04e3
  14. 29 4月, 2021 2 次提交
    • P
      Refactor: use TableBuilderOptions to reduce parameter lists (#8240) · 85becd94
      Peter Dillinger 提交于
      Summary:
      Greatly reduced the not-quite-copy-paste giant parameter lists
      of rocksdb::NewTableBuilder, rocksdb::BuildTable,
      BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
      
      Moved weird separate parameter `uint32_t column_family_id` of
      TableFactory::NewTableBuilder into TableBuilderOptions.
      
      Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
      target_file_size` is not randomly placed between uint64_t timestamps
      (was easy to mix up).
      
      Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
      FilterBuildingContext. The motivation for this change is making it
      easier to pass along more data into new fields in FilterBuildingContext
      (follow-up PR).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
      
      Test Plan: ASAN make check
      
      Reviewed By: mrambacher
      
      Differential Revision: D28075891
      
      Pulled By: pdillinger
      
      fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
      85becd94
    • A
      Improve BlockPrefetcher to prefetch only for sequential scans (#7394) · a0e0feca
      Akanksha Mahajan 提交于
      Summary:
      BlockPrefetcher is used by iterators to prefetch data if they
      anticipate more data to be used in future and this is valid for forward sequential
      scans. But BlockPrefetcher tracks only num_file_reads_ and not if reads
      are sequential. This presents problem for MultiGet with large number of
      keys when it reseeks index iterator and data block. FilePrefetchBuffer
      can end up doing large readahead for reseeks as readahead size
      increases exponentially once readahead is enabled. Same issue is with
      BlockBasedTableIterator.
      
      Add previous length and offset read as well in BlockPrefetcher (creates
      FilePrefetchBuffer) and FilePrefetchBuffer (does prefetching of data) to
      determine if reads are sequential and then  prefetch.
      
      Update the last block read after cache hit to take reads from cache also
      in account.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7394
      
      Test Plan: Add new unit test case
      
      Reviewed By: anand1976
      
      Differential Revision: D23737617
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 8e6917c25ed87b285ee495d1b68dc623d71205a3
      a0e0feca