1. 16 9月, 2022 1 次提交
    • P
      Revamp, optimize new experimental clock cache (#10626) · 57243486
      Peter Dillinger 提交于
      Summary:
      * Consolidates most metadata into a single word per slot so that more
      can be accomplished with a single atomic update. In the common case,
      Lookup was previously about 4 atomic updates, now just 1 atomic update.
      Common case Release was previously 1 atomic read + 1 atomic update,
      now just 1 atomic update.
      * Eliminate spins / waits / yields, which likely threaten some "lock free"
      benefits. Compare-exchange loops are only used in explicit Erase, and
      strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
      exchange.
      * Relaxes some aggressiveness and guarantees. For example,
        * Duplicate Inserts will sometimes go undetected and the shadow duplicate
          will age out with eviction.
        * In many cases, the older Inserted value for a given cache key will be kept
        (i.e. Insert does not support overwrite).
        * Entries explicitly erased (rather than evicted) might not be freed
        immediately in some rare cases.
        * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
        precisely as LRUCache, but is self-correcting and should only deviate by a
        very small number of extra or fewer entries.
      * Use smaller "computed default" number of cache shards in many cases,
      because benefits to larger usage tracking / eviction pools outweigh the small
      cost of more lock-free atomic contention. The improvement in CPU and I/O
      is dramatic in some limit-memory cases.
      * Even without the sharding change, the eviction algorithm is likely more
      effective than LRU overall because it's more stateful, even though the
      "hot path" state tracking for it is essentially free with ref counting. It
      is like a generalized CLOCK with aging (see code comments). I don't have
      performance numbers showing a specific improvement, but in theory, for a
      Poisson access pattern to each block, keeping some state allows better
      estimation of time to next access (Poisson interval) than strict LRU. The
      bounded randomness in CLOCK can also reduce "cliff" effect for repeated
      range scans approaching and exceeding cache size.
      
      ## Hot path algorithm comparison
      Rough descriptions, focusing on number and kind of atomic operations:
      * Old `Lookup()` (2-5 atomic updates per probe):
      ```
      Loop:
        Increment internal ref count at slot
        If possible hit:
          Check flags atomic (and non-atomic fields)
          If cache hit:
            Three distinct updates to 'flags' atomic
            Increment refs for internal-to-external
            Return
        Decrement internal ref count
      while atomic read 'displacements' > 0
      ```
      * New `Lookup()` (1-2 atomic updates per probe):
      ```
      Loop:
        Increment acquire counter in meta word (optimistic)
        If visible entry (already read meta word):
          If match (read non-atomic fields):
            Return
          Else:
            Decrement acquire counter in meta word
        Else if invisible entry (rare, already read meta word):
          Decrement acquire counter in meta word
      while atomic read 'displacements' > 0
      ```
      * Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
      ```
      Read atomic ref count
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      Else:
        Decrement ref count
      ```
      * New `Release()` (1 unconditional atomic update, rarely more):
      ```
      Increment release counter in meta word
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      ```
      
      ## Performance test setup
      Build DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
      ```
      Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
      
      base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
      folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
      gt_clock: experimental ClockCache before this change
      new_clock: experimental ClockCache with this change
      
      ## Performance test results
      First test "hot path" read performance, with block cache large enough for whole DB:
      4181MB 1thread base -> kops/s: 47.761
      4181MB 1thread folly -> kops/s: 45.877
      4181MB 1thread gt_clock -> kops/s: 51.092
      4181MB 1thread new_clock -> kops/s: 53.944
      
      4181MB 16thread base -> kops/s: 284.567
      4181MB 16thread folly -> kops/s: 249.015
      4181MB 16thread gt_clock -> kops/s: 743.762
      4181MB 16thread new_clock -> kops/s: 861.821
      
      4181MB 24thread base -> kops/s: 303.415
      4181MB 24thread folly -> kops/s: 266.548
      4181MB 24thread gt_clock -> kops/s: 975.706
      4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
      
      4181MB 32thread base -> kops/s: 311.251
      4181MB 32thread folly -> kops/s: 274.952
      4181MB 32thread gt_clock -> kops/s: 1045.98
      4181MB 32thread new_clock -> kops/s: 1370.38
      
      4181MB 48thread base -> kops/s: 310.504
      4181MB 48thread folly -> kops/s: 268.322
      4181MB 48thread gt_clock -> kops/s: 1195.65
      4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
      
      4181MB 64thread base -> kops/s: 307.839
      4181MB 64thread folly -> kops/s: 272.172
      4181MB 64thread gt_clock -> kops/s: 1204.47
      4181MB 64thread new_clock -> kops/s: 1615.37
      
      4181MB 128thread base -> kops/s: 310.934
      4181MB 128thread folly -> kops/s: 267.468
      4181MB 128thread gt_clock -> kops/s: 1188.75
      4181MB 128thread new_clock -> kops/s: 1595.46
      
      Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
      
      Now test a large block cache with low miss ratio, but some eviction is required:
      1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
      1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
      1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
      1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
      
      1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
      1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
      1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
      1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
      
      1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
      1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
      1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
      1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
      
      610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
      610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
      610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
      610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
      
      610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
      610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
      610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
      610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
      
      610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
      610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
      610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
      610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
      
      The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
      
      Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
      
      233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
      233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
      233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
      233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
      
      233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
      233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
      233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
      233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
      
      233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
      233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
      233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
      233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
      
      89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
      89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
      89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
      89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
      
      89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
      89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
      89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
      89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
      
      89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
      89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
      89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
      89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
      
      ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
      
      Even smaller cache size:
      34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
      34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
      34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
      34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
      
      34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
      34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
      34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
      34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
      
      34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
      34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
      34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
      34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
      
      As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
      
      13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
      13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
      13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
      13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
      
      13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
      13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
      13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
      13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
      
      13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
      13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
      13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
      13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
      
      gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
      
      13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
      13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
      13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
      13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
      
      13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
      13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
      13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
      13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
      
      Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
      
      Reviewed By: anand1976
      
      Differential Revision: D39368406
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
      57243486
  2. 14 7月, 2022 1 次提交
  3. 29 6月, 2022 1 次提交
  4. 14 4月, 2022 1 次提交
  5. 13 4月, 2022 1 次提交
    • P
      Meta-internal folly integration with F14FastMap (#9546) · efd03516
      Peter Dillinger 提交于
      Summary:
      Especially after updating to C++17, I don't see a compelling case for
      *requiring* any folly components in RocksDB. I was able to purge the existing
      hard dependencies, and it can be quite difficult to strip out non-trivial components
      from folly for use in RocksDB. (The prospect of doing that on F14 has changed
      my mind on the best approach here.)
      
      But this change creates an optional integration where we can plug in
      components from folly at compile time, starting here with F14FastMap to replace
      std::unordered_map when possible (probably no public APIs for example). I have
      replaced the biggest CPU users of std::unordered_map with compile-time
      pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
      USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
      that is in the Makefile for public CI testing. A full folly build is not needed, but
      checking out the full folly repo is much simpler for getting the dependency,
      and anything else we might want to optionally integrate in the future.
      
      Some picky details:
      * I don't think the distributed mutex stuff is actually used, so it was easy to remove.
      * I implemented an alternative to `folly::constexpr_log2` (which is much easier
      in C++17 than C++11) so that I could pull out the hard dependencies on
      `ConstexprMath.h`
      * I had to add noexcept move constructors/operators to some types to make
      F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
      macro to make that easier in some common cases.
      * Updated Meta-internal buck build to use folly F14Map (always)
      
      No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
      production integration for open source users.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
      
      Test Plan:
      CircleCI tests updated so that a couple of them use folly.
      
      Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
      (Note: they should probably use buck but they currently use Makefile.)
      
      Example performance improvement: when filter partitions are pinned in cache,
      they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
      a test that exercises that heavily. Build DB with
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
      ```
      
      and test with (simultaneous runs with & without folly, ~20 times each to see
      convergence)
      
      ```
      TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
      ```
      
      Average ops/s no folly: 26229.2
      Average ops/s with folly: 26853.3 (+2.4%)
      
      Reviewed By: ajkr
      
      Differential Revision: D34181736
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
      efd03516
  6. 21 10月, 2021 1 次提交
    • A
      Support `GetMapProperty()` with "rocksdb.dbstats" (#9057) · 4217d1bc
      Andrew Kryczka 提交于
      Summary:
      This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057
      
      Test Plan: new unit test
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D31781495
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
      4217d1bc
  7. 09 9月, 2021 1 次提交
    • Z
      Add DB properties for BlobDB (#8734) · 0cb0fc6f
      Zhiyi Zhang 提交于
      Summary:
      RocksDB exposes certain internal statistics via the DB property interface.
      However, there are currently no properties related to BlobDB.
      
      For starters, we would like to add the following BlobDB properties:
      `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
      `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
      `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
      `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
      `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734
      
      Test Plan:
      ```
      ➜  rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
      [==========] Running 16 tests from 2 test cases.
      [----------] Global test environment set-up.
      [----------] 10 tests from DBBlobBasicTest
      [ RUN      ] DBBlobBasicTest.GetBlob
      [       OK ] DBBlobBasicTest.GetBlob (12 ms)
      [ RUN      ] DBBlobBasicTest.MultiGetBlobs
      [       OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_CorruptIndex
      [       OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
      [       OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
      [       OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
      [ RUN      ] DBBlobBasicTest.GenerateIOTracing
      [       OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
      [ RUN      ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
      [       OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
      [ RUN      ] DBBlobBasicTest.GetMergeBlobWithPut
      [       OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
      [ RUN      ] DBBlobBasicTest.MultiGetMergeBlobWithPut
      [       OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
      [ RUN      ] DBBlobBasicTest.BlobDBProperties
      [       OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
      [----------] 10 tests from DBBlobBasicTest (124 ms total)
      
      [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
      [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
      
      [----------] Global test environment tear-down
      [==========] 16 tests from 2 test cases ran. (2190 ms total)
      [  PASSED  ] 16 tests.
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D30690849
      
      Pulled By: Zhiyi-Zhang
      
      fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
      0cb0fc6f
  8. 16 8月, 2021 1 次提交
  9. 17 7月, 2021 1 次提交
    • P
      Don't hold DB mutex for block cache entry stat scans (#8538) · df5dc73b
      Peter Dillinger 提交于
      Summary:
      I previously didn't notice the DB mutex was being held during
      block cache entry stat scans, probably because I primarily checked for
      read performance regressions, because they require the block cache and
      are traditionally latency-sensitive.
      
      This change does some refactoring to avoid holding DB mutex and to
      avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
      Some tests have to be updated because now the stats collector is
      populated in the Cache aggressively on DB startup rather than lazily.
      (I hope to clean up some of this added complexity in the future.)
      
      This change also ensures proper treatment of need_out_of_mutex for
      non-int DB properties.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
      
      Test Plan:
      Added unit test logic that uses sync points to fail if the DB mutex
      is held during a scan, covering the various ways that a scan might be
      triggered.
      
      Performance test - the known impact to holding the DB mutex is on
      TransactionDB, and the easiest way to see the impact is to hack the
      scan code to almost always miss and take an artificially long time
      scanning. Here I've injected an unconditional 5s sleep at the call to
      ApplyToAllEntries.
      
      Before (hacked):
      
          $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
          rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
          $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
          rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
      
      Notice the 5s P100 write time.
      
      After (hacked):
      
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
          rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
          rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
      
      P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
      
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
          rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
          rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
      
      No substantial difference.
      
      Reviewed By: siying
      
      Differential Revision: D29738847
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
      df5dc73b
  10. 14 6月, 2021 1 次提交
    • P
      Pin CacheEntryStatsCollector to fix performance bug (#8385) · d5a46c40
      Peter Dillinger 提交于
      Summary:
      If the block Cache is full with strict_capacity_limit=false,
      then our CacheEntryStatsCollector could be immediately evicted on
      release, so iterating through column families with shared block cache
      could trigger re-scan for each CF. This change fixes that problem by
      pinning the CacheEntryStatsCollector from InternalStats so that it's not
      evicted.
      
      I had originally thought that this object could participate in LRU like
      everything else, but even though a re-load+re-scan only touches memory,
      it can be orders of magnitude more expensive than other cache misses.
      One service in Facebook has scans that take ~20s over 100GB block cache
      that is mostly 4KB entries. (The up-side of this bug and https://github.com/facebook/rocksdb/issues/8369 is that
      we had a natural experiment on the effect on some service metrics even
      with block cache scans running continuously in the background--a kind
      of worst case scenario. Metrics like latency were not affected enough
      to trigger warnings.)
      
      Other smaller fixes:
      
      20s is already a sizable portion of 600s stats dump period, or 180s
      default max age to force re-scan, so added logic to ensure that (for
      each block cache) we don't spend more than 0.2% of our background thread
      time scanning it. Nevertheless, "foreground" requests for cache entry
      stats (calls to `db->GetMapProperty(DB::Properties::kBlockCacheEntryStats)`)
      are permitted to consume more CPU.
      
      Renamed field to cache_entry_stats_ to match code style.
      
      This change is intended for patching in 6.21 release.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8385
      
      Test Plan:
      unit test expanded to cover new logic (detect regression),
      some manual testing with db_bench
      
      Reviewed By: ajkr
      
      Differential Revision: D29042759
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 236faa902397f50038c618f50fbc8cf3f277308c
      d5a46c40
  11. 20 5月, 2021 1 次提交
    • P
      Use deleters to label cache entries and collect stats (#8297) · 311a544c
      Peter Dillinger 提交于
      Summary:
      This change gathers and publishes statistics about the
      kinds of items in block cache. This is especially important for
      profiling relative usage of cache by index vs. filter vs. data blocks.
      It works by iterating over the cache during periodic stats dump
      (InternalStats, stats_dump_period_sec) or on demand when
      DB::Get(Map)Property(kBlockCacheEntryStats), except that for
      efficiency and sharing among column families, saved data from
      the last scan is used when the data is not considered too old.
      
      The new information can be seen in info LOG, for example:
      
          Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
          Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
      
      And also through DB::GetProperty and GetMapProperty (here using
      ldb just for demonstration):
      
          $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
          rocksdb.block-cache-entry-stats.bytes.data-block: 0
          rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.bytes.index-block: 178992
          rocksdb.block-cache-entry-stats.bytes.misc: 0
          rocksdb.block-cache-entry-stats.bytes.other-block: 0
          rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
          rocksdb.block-cache-entry-stats.capacity: 8388608
          rocksdb.block-cache-entry-stats.count.data-block: 0
          rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.count.index-block: 215
          rocksdb.block-cache-entry-stats.count.misc: 1
          rocksdb.block-cache-entry-stats.count.other-block: 0
          rocksdb.block-cache-entry-stats.count.write-buffer: 0
          rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
          rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
          rocksdb.block-cache-entry-stats.percent.misc: 0.000000
          rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
          rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
          rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
      
      Solution detail - We need some way to flag what kind of blocks each
      entry belongs to, preferably without changing the Cache API.
      One of the complications is that Cache is a general interface that could
      have other users that don't adhere to whichever convention we decide
      on for keys and values. Or we would pay for an extra field in the Handle
      that would only be used for this purpose.
      
      This change uses a back-door approach, the deleter, to indicate the
      "role" of a Cache entry (in addition to the value type, implicitly).
      This has the added benefit of ensuring proper code origin whenever we
      recognize a particular role for a cache entry; if the entry came from
      some other part of the code, it will use an unrecognized deleter, which
      we simply attribute to the "Misc" role.
      
      An internal API makes for simple instantiation and automatic
      registration of Cache deleters for a given value type and "role".
      
      Another internal API, CacheEntryStatsCollector, solves the problem of
      caching the results of a scan and sharing them, to ensure scans are
      neither excessive nor redundant so as not to harm Cache performance.
      
      Because code is added to BlocklikeTraits, it is pulled out of
      block_based_table_reader.cc into its own file.
      
      This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
      (could still be added), and with actual stat gathering.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
      
      Test Plan: manual testing with db_bench, and a couple of basic unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28488721
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
      311a544c
  12. 15 3月, 2021 1 次提交
    • M
      Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033) · 3dff28cf
      mrambacher 提交于
      Summary:
      For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.
      
      For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
      
      There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.
      
      Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
      
      6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
      6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
      PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)
      
      (Note that the times for 6.18 are prior to revert of the SystemClock).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
      
      Reviewed By: pdillinger
      
      Differential Revision: D27014563
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
      3dff28cf
  13. 04 3月, 2021 1 次提交
    • L
      Update compaction statistics to include the amount of data read from blob files (#8022) · cb25bc11
      Levi Tamasi 提交于
      Summary:
      The patch does the following:
      1) Exposes the amount of data (number of bytes) read from blob files from
      `BlobFileReader::GetBlob` / `Version::GetBlob`.
      2) Tracks the total number and size of blobs read from blob files during a
      compaction (due to garbage collection or compaction filter usage) in
      `CompactionIterationStats` and propagates this data to
      `InternalStats::CompactionStats` / `CompactionJobStats`.
      3) Updates the formulae for write amplification calculations to include the
      amount of data read from blob files.
      4) Extends the compaction stats dump with a new column `Rblob(GB)` and
      a new line containing the total number and size of blob files in the current
      `Version` to complement the information about the shape and size of the LSM tree
      that's already there.
      5) Updates `CompactionJobStats` so that the number of files and amount of data
      written by a compaction are broken down per file type (i.e. table/blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26801199
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
      cb25bc11
  14. 03 3月, 2021 1 次提交
    • L
      Break down the amount of data written during flushes/compactions per file type (#8013) · a46f080c
      Levi Tamasi 提交于
      Summary:
      The patch breaks down the "bytes written" (as well as the "number of output files")
      compaction statistics into two, so the values are logged separately for table files
      and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
      files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
      This will also come in handy for fixing the write amplification statistics, which currently
      do not consider the amount of data read from blob files during compaction. (This will
      be fixed by an upcoming patch.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26742156
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
      a46f080c
  15. 26 1月, 2021 1 次提交
    • M
      Add a SystemClock class to capture the time functions of an Env (#7858) · 12f11373
      mrambacher 提交于
      Summary:
      Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
      
      Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
      
      There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
      
      Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
      
      Reviewed By: pdillinger
      
      Differential Revision: D26006406
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
      12f11373
  16. 20 12月, 2020 1 次提交
    • P
      aggregated-table-properties with GetMapProperty (#7779) · 4d1ac19e
      Peter Dillinger 提交于
      Summary:
      So that we can more easily get aggregate live table data such
      as total filter, index, and data sizes.
      
      Also adds ldb support for getting properties
      
      Also fixed some missing/inaccurate related comments in db.h
      
      For example:
      
          $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
          rocksdb.aggregated-table-properties.data_size: 102871
          rocksdb.aggregated-table-properties.filter_size: 0
          rocksdb.aggregated-table-properties.index_partitions: 0
          rocksdb.aggregated-table-properties.index_size: 2232
          rocksdb.aggregated-table-properties.num_data_blocks: 100
          rocksdb.aggregated-table-properties.num_deletions: 0
          rocksdb.aggregated-table-properties.num_entries: 15000
          rocksdb.aggregated-table-properties.num_merge_operands: 0
          rocksdb.aggregated-table-properties.num_range_deletions: 0
          rocksdb.aggregated-table-properties.raw_key_size: 288890
          rocksdb.aggregated-table-properties.raw_value_size: 198890
          rocksdb.aggregated-table-properties.top_level_index_size: 0
          $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
          rocksdb.aggregated-table-properties-at-level1.data_size: 80909
          rocksdb.aggregated-table-properties-at-level1.filter_size: 0
          rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
          rocksdb.aggregated-table-properties-at-level1.index_size: 1787
          rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
          rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
          rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
          rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
          rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
          rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
          rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
          rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
          $
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779
      
      Test Plan: Added a test to ldb_test.py
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D25653103
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
      4d1ac19e
  17. 16 10月, 2020 1 次提交
    • L
      Introduce BlobFileCache and add support for blob files to Get() (#7540) · e8cb32ed
      Levi Tamasi 提交于
      Summary:
      The patch adds blob file support to the `Get` API by extending `Version` so that
      whenever a blob reference is read from a file, the blob is retrieved from the corresponding
      blob file and passed back to the caller. (This is assuming the blob reference is valid
      and the blob file is actually part of the given `Version`.) It also introduces a cache
      of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
      between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
      `max_open_files` (if specified) limits the total number of open (table + blob) files.
      
      TODO: proactively open/cache blob files and pin the cache handles of the readers in the
      metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
      table files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D24260219
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
      e8cb32ed
  18. 08 10月, 2020 1 次提交
    • L
      Introduce a blob file reader class (#7461) · 22655a39
      Levi Tamasi 提交于
      Summary:
      The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
      using the information available in blob references (e.g. blob file number, offset, and
      size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
      and iterators, and also for compaction/garbage collection.
      
      When a `BlobFileReader` object is created (using the factory method `Create`),
      it first checks whether the specified file is potentially valid by comparing the file
      size against the combined size of the blob file header and footer (files smaller than
      the threshold are considered malformed). Then, it opens the file, and reads and verifies
      the header and footer. The verification involves magic number/CRC checks
      as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
      or TTL blob files.
      
      Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
      type passed by the caller (because of the presence of the header and footer, the
      specified offset cannot be too close to the start/end of the file; also, the compression type
      has to match the one in the blob file header), and retrieves and potentially verifies and
      uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
      `BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
      and verifies the key/value size, the key itself, as well as the CRC of the blob record header
      and the key/value pair.
      
      In addition, the patch exposes the compression type from `BlobIndex` (both using an
      accessor and via `DebugString`), and adds a blob file read latency histogram to
      `InternalStats` that can be used with `BlobFileReader`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D23999219
      
      Pulled By: ltamasi
      
      fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
      22655a39
  19. 15 9月, 2020 1 次提交
    • L
      Integrate blob file writing with the flush logic (#7345) · b0e78341
      Levi Tamasi 提交于
      Summary:
      The patch adds support for writing blob files during flush by integrating
      `BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
      `CompactionIterator`. If `enable_blob_files` is set, large values are extracted
      to blob files and replaced with references. The resulting blob files are then
      logged to the MANIFEST as part of the flush job's `VersionEdit` and
      added to the `Version`, similarly to table files. Errors related to writing
      blob files fail the flush, and any blob files written by such jobs are immediately
      deleted (again, similarly to how SST files are handled). In addition, the patch
      extends the logging and statistics around flushes to account for the presence
      of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
      used for calculating write amplification, now considers the blob files as well).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
      
      Test Plan: Tested using `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D23506369
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
      b0e78341
  20. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  21. 08 1月, 2020 1 次提交
  22. 07 9月, 2019 1 次提交
  23. 13 7月, 2019 1 次提交
  24. 20 3月, 2019 1 次提交
    • Z
      Collect compaction stats by priority and dump to info LOG (#5050) · a291f3a1
      Zhongyi Xie 提交于
      Summary:
      In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump.
      
      ```
      ** Compaction Stats [default] **
      Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
      -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       Low      0/0    0.00 KB   0.0     16.8    11.3      5.5       5.6      0.1       0.0   0.0    406.4    136.1     42.24             34.96        45    0.939     13M  8865K
      High      0/0    0.00 KB   0.0      0.0     0.0      0.0      11.4     11.4       0.0   0.0      0.0     76.2    153.00             35.74     12185    0.013       0      0
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050
      
      Differential Revision: D14408583
      
      Pulled By: miasantreble
      
      fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608
      a291f3a1
  25. 30 1月, 2019 1 次提交
  26. 06 11月, 2018 1 次提交
    • A
      Add DB property for SST files kept from deletion (#4618) · fffac43c
      Andrew Kryczka 提交于
      Summary:
      This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618
      
      Differential Revision: D12898179
      
      Pulled By: ajkr
      
      fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
      fffac43c
  27. 16 6月, 2018 1 次提交
  28. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  29. 19 4月, 2018 1 次提交
  30. 13 4月, 2018 1 次提交
  31. 12 4月, 2018 1 次提交
  32. 02 3月, 2018 1 次提交
    • Y
      Add "rocksdb.live-sst-files-size" DB property · bf937cf1
      Yi Wu 提交于
      Summary:
      Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
      Closes https://github.com/facebook/rocksdb/pull/3548
      
      Differential Revision: D7116939
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
      bf937cf1
  33. 24 10月, 2017 1 次提交
    • Y
      Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e
      Yi Wu 提交于
      Summary:
      With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
      
      My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
      Closes https://github.com/facebook/rocksdb/pull/2842
      
      Differential Revision: D5770600
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
      66a2c44e
  34. 08 9月, 2017 1 次提交
  35. 31 8月, 2017 1 次提交
    • A
      Extend property map with compaction stats · 8a6708f5
      Artem Danilov 提交于
      Summary:
      This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
      Closes https://github.com/facebook/rocksdb/pull/2794
      
      Differential Revision: D5717676
      
      Pulled By: Tema
      
      fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
      8a6708f5
  36. 16 7月, 2017 1 次提交
  37. 01 7月, 2017 1 次提交
  38. 30 6月, 2017 1 次提交
    • M
      Add a fetch_add variation to AddDBStats · e9f91a51
      Maysam Yabandeh 提交于
      Summary:
      AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.
      
      This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
      - vanilla: 68618 tps
      - removing mutex on AddDBStats (unsafe): 69767 tps
      - fetch_add for all AddDBStats: 69200 tps
      - fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
      Closes https://github.com/facebook/rocksdb/pull/2505
      
      Differential Revision: D5330656
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
      e9f91a51
  39. 28 4月, 2017 1 次提交
  40. 22 4月, 2017 1 次提交