1. 17 9月, 2022 3 次提交
    • G
      Add enable_split_merge option for CompressedSecondaryCache (#10690) · 2cc5b395
      gitbw95 提交于
      Summary:
      `enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690
      
      Test Plan:
      Unit Tests
      Stress Tests
      
      Reviewed By: anand1976
      
      Differential Revision: D39567604
      
      Pulled By: gitbw95
      
      fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
      2cc5b395
    • A
      Fix an incorrect MultiGet assertion (#10695) · e053ccde
      anand76 提交于
      Summary:
      The assertion in ```FilePickerMultiGet::ReplaceRange()``` was incorrect. The function should only be called to replace the range after finishing the search in the current level, which is indicated by ```hit_file_ == nullptr``` i.e no more overlapping files in this level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10695
      
      Reviewed By: gitbw95
      
      Differential Revision: D39583217
      
      Pulled By: anand1976
      
      fbshipit-source-id: d4cedfb2b62fb9f3a083e9848a403ae6342f0519
      e053ccde
    • P
      Call experimental new clock cache HyperClockCache (#10684) · 0f91c72a
      Peter Dillinger 提交于
      Summary:
      This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
      * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
      * We want to highlight the key feature: it's fast (especially under parallel load)
      * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
      * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
      
      Some API detail:
      * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
      * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
      
      For performance tests see https://github.com/facebook/rocksdb/pull/10626
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
      
      Test Plan: no interesting functional changes; tests updated
      
      Reviewed By: anand1976
      
      Differential Revision: D39547800
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
      0f91c72a
  2. 16 9月, 2022 7 次提交
    • P
      Revamp, optimize new experimental clock cache (#10626) · 57243486
      Peter Dillinger 提交于
      Summary:
      * Consolidates most metadata into a single word per slot so that more
      can be accomplished with a single atomic update. In the common case,
      Lookup was previously about 4 atomic updates, now just 1 atomic update.
      Common case Release was previously 1 atomic read + 1 atomic update,
      now just 1 atomic update.
      * Eliminate spins / waits / yields, which likely threaten some "lock free"
      benefits. Compare-exchange loops are only used in explicit Erase, and
      strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
      exchange.
      * Relaxes some aggressiveness and guarantees. For example,
        * Duplicate Inserts will sometimes go undetected and the shadow duplicate
          will age out with eviction.
        * In many cases, the older Inserted value for a given cache key will be kept
        (i.e. Insert does not support overwrite).
        * Entries explicitly erased (rather than evicted) might not be freed
        immediately in some rare cases.
        * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
        precisely as LRUCache, but is self-correcting and should only deviate by a
        very small number of extra or fewer entries.
      * Use smaller "computed default" number of cache shards in many cases,
      because benefits to larger usage tracking / eviction pools outweigh the small
      cost of more lock-free atomic contention. The improvement in CPU and I/O
      is dramatic in some limit-memory cases.
      * Even without the sharding change, the eviction algorithm is likely more
      effective than LRU overall because it's more stateful, even though the
      "hot path" state tracking for it is essentially free with ref counting. It
      is like a generalized CLOCK with aging (see code comments). I don't have
      performance numbers showing a specific improvement, but in theory, for a
      Poisson access pattern to each block, keeping some state allows better
      estimation of time to next access (Poisson interval) than strict LRU. The
      bounded randomness in CLOCK can also reduce "cliff" effect for repeated
      range scans approaching and exceeding cache size.
      
      ## Hot path algorithm comparison
      Rough descriptions, focusing on number and kind of atomic operations:
      * Old `Lookup()` (2-5 atomic updates per probe):
      ```
      Loop:
        Increment internal ref count at slot
        If possible hit:
          Check flags atomic (and non-atomic fields)
          If cache hit:
            Three distinct updates to 'flags' atomic
            Increment refs for internal-to-external
            Return
        Decrement internal ref count
      while atomic read 'displacements' > 0
      ```
      * New `Lookup()` (1-2 atomic updates per probe):
      ```
      Loop:
        Increment acquire counter in meta word (optimistic)
        If visible entry (already read meta word):
          If match (read non-atomic fields):
            Return
          Else:
            Decrement acquire counter in meta word
        Else if invisible entry (rare, already read meta word):
          Decrement acquire counter in meta word
      while atomic read 'displacements' > 0
      ```
      * Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
      ```
      Read atomic ref count
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      Else:
        Decrement ref count
      ```
      * New `Release()` (1 unconditional atomic update, rarely more):
      ```
      Increment release counter in meta word
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      ```
      
      ## Performance test setup
      Build DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
      ```
      Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
      
      base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
      folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
      gt_clock: experimental ClockCache before this change
      new_clock: experimental ClockCache with this change
      
      ## Performance test results
      First test "hot path" read performance, with block cache large enough for whole DB:
      4181MB 1thread base -> kops/s: 47.761
      4181MB 1thread folly -> kops/s: 45.877
      4181MB 1thread gt_clock -> kops/s: 51.092
      4181MB 1thread new_clock -> kops/s: 53.944
      
      4181MB 16thread base -> kops/s: 284.567
      4181MB 16thread folly -> kops/s: 249.015
      4181MB 16thread gt_clock -> kops/s: 743.762
      4181MB 16thread new_clock -> kops/s: 861.821
      
      4181MB 24thread base -> kops/s: 303.415
      4181MB 24thread folly -> kops/s: 266.548
      4181MB 24thread gt_clock -> kops/s: 975.706
      4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
      
      4181MB 32thread base -> kops/s: 311.251
      4181MB 32thread folly -> kops/s: 274.952
      4181MB 32thread gt_clock -> kops/s: 1045.98
      4181MB 32thread new_clock -> kops/s: 1370.38
      
      4181MB 48thread base -> kops/s: 310.504
      4181MB 48thread folly -> kops/s: 268.322
      4181MB 48thread gt_clock -> kops/s: 1195.65
      4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
      
      4181MB 64thread base -> kops/s: 307.839
      4181MB 64thread folly -> kops/s: 272.172
      4181MB 64thread gt_clock -> kops/s: 1204.47
      4181MB 64thread new_clock -> kops/s: 1615.37
      
      4181MB 128thread base -> kops/s: 310.934
      4181MB 128thread folly -> kops/s: 267.468
      4181MB 128thread gt_clock -> kops/s: 1188.75
      4181MB 128thread new_clock -> kops/s: 1595.46
      
      Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
      
      Now test a large block cache with low miss ratio, but some eviction is required:
      1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
      1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
      1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
      1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
      
      1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
      1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
      1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
      1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
      
      1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
      1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
      1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
      1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
      
      610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
      610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
      610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
      610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
      
      610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
      610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
      610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
      610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
      
      610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
      610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
      610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
      610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
      
      The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
      
      Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
      
      233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
      233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
      233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
      233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
      
      233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
      233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
      233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
      233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
      
      233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
      233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
      233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
      233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
      
      89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
      89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
      89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
      89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
      
      89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
      89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
      89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
      89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
      
      89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
      89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
      89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
      89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
      
      ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
      
      Even smaller cache size:
      34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
      34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
      34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
      34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
      
      34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
      34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
      34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
      34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
      
      34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
      34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
      34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
      34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
      
      As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
      
      13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
      13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
      13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
      13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
      
      13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
      13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
      13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
      13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
      
      13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
      13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
      13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
      13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
      
      gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
      
      13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
      13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
      13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
      13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
      
      13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
      13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
      13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
      13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
      
      Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
      
      Reviewed By: anand1976
      
      Differential Revision: D39368406
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
      57243486
    • A
      Fix some MultiGet stats (#10673) · 37b75e13
      anand76 提交于
      Summary:
      The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39492615
      
      Pulled By: anand1976
      
      fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3
      37b75e13
    • Y
      Re-enable user-defined timestamp and subcompactions (#10689) · 088b9844
      Yanqin Jin 提交于
      Summary:
      Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions
      after https://github.com/facebook/rocksdb/issues/10658.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689
      
      Test Plan:
      Make sure the following succeeds on devserver.
      make crash_test_with_ts
      
      Reviewed By: ltamasi
      
      Differential Revision: D39556558
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149
      088b9844
    • A
      Fix a MultiGet crash (#10688) · c206aebd
      anand76 提交于
      Summary:
      Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688
      
      Reviewed By: gitbw95
      
      Differential Revision: D39556131
      
      Pulled By: anand1976
      
      fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458
      c206aebd
    • A
      move db_stress locking to `StressTest::Test*()` functions (#10678) · 6ce782be
      Andrew Kryczka 提交于
      Summary:
      One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test*()` function is responsible for acquiring and releasing its own locks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678
      
      Reviewed By: hx235
      
      Differential Revision: D39516401
      
      Pulled By: ajkr
      
      fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758
      6ce782be
    • L
      Support JemallocNodumpAllocator for the block/blob cache in db_bench (#10685) · 7dad4852
      Levi Tamasi 提交于
      Summary:
      The patch makes it possible to use the `JemallocNodumpAllocator` with the
      block/blob caches in `db_bench`. In addition to its stated purpose of excluding
      cache contents from core dumps, `JemallocNodumpAllocator` also uses
      a dedicated arena and jemalloc tcaches for cache allocations, which can
      reduce fragmentation and thus memory usage.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10685
      
      Reviewed By: riversand963
      
      Differential Revision: D39552261
      
      Pulled By: ltamasi
      
      fbshipit-source-id: b5c58eab6b7c1baa9a307d9f1248df1d7a77d2b5
      7dad4852
    • B
      Disable PersistentCacheTierTest.BasicTest (#10683) · b418ace3
      Bo Wang 提交于
      Summary:
      Disable this flaky test since PersistentCache is not used.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10683
      
      Test Plan: Unit Tests
      
      Reviewed By: cbi42
      
      Differential Revision: D39545974
      
      Pulled By: gitbw95
      
      fbshipit-source-id: ac53e96f6ba880e7612e325eb5ff22ee2799efed
      b418ace3
  3. 15 9月, 2022 7 次提交
    • J
      Tiered Storage feature doesn't support BlobDB yet (#10681) · 1cdc8411
      Jay Zhuang 提交于
      Summary:
      Disable the tiered storage + BlobDB test.
      Also enable different hot data setting for Tiered compaction
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10681
      
      Reviewed By: ajkr
      
      Differential Revision: D39531941
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: aa0595eb38d03f17638d300d2e4cc9061429bf61
      1cdc8411
    • J
      Refactor Compaction file cut `ShouldStopBefore()` (#10629) · 849cf1bf
      Jay Zhuang 提交于
      Summary:
      Consolidate compaction output cut logic to `ShouldStopBefore()` and move
      it inside of CompactionOutputs class.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629
      
      Reviewed By: cbi42
      
      Differential Revision: D39315536
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18
      849cf1bf
    • Y
      Fix a bug by setting up subcompaction bounds properly (#10658) · ce2c11d8
      Yanqin Jin 提交于
      Summary:
      When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658
      
      Test Plan:
      ```bash
      make check
      rm -rf /dev/shm/rocksdb/* && mkdir
      /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress
      --allow_data_in_errors=True --clear_column_family_one_in=0
      --continuous_verification_interval=0 --data_block_index_type=1
      --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
      --delrangepercent=0
      --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
      --iterpercent=0 --max_background_compactions=20
      --max_bytes_for_level_base=10485760 --max_key=25000000
      --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
      --ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0
      --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
      --snapshot_hold_ops=100000 --subcompactions=4
      --target_file_size_base=65536 --target_file_size_multiplier=2
      --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
      --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
      --write_buffer_size=65536 --writepercent=60 -disable_wal=1
      -column_families=1
      ```
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39393402
      
      Pulled By: riversand963
      
      fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871
      ce2c11d8
    • C
      Fix data race in accessing `cached_range_tombstone_` (#10680) · be04a3b6
      Changyu Bi 提交于
      Summary:
      fix a data race introduced in https://github.com/facebook/rocksdb/issues/10547 (P5295241720), first reported by pdillinger. The race is between the `std::atomic_load_explicit` in NewRangeTombstoneIteratorInternal and the `std::atomic_store_explicit` in MemTable::Add() that operate on `cached_range_tombstone_`. P5295241720 shows that `atomic_store_explicit` initializes some mutex which `atomic_load_explicit` could be trying to call `lock()` on at the same time. This fix moves the initialization to memtable constructor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10680
      
      Test Plan: `USE_CLANG=1 COMPILE_WITH_TSAN=1 make -j24 whitebox_crash_test`
      
      Reviewed By: ajkr
      
      Differential Revision: D39528696
      
      Pulled By: cbi42
      
      fbshipit-source-id: ee740841044438e18ad2b8ea567444dd542dd8e2
      be04a3b6
    • Y
      Reset pessimistic transaction's read/commit timestamps during Initialize() (#10677) · 832fd644
      Yanqin Jin 提交于
      Summary:
      RocksDB allows reusing old `Transaction` objects when creating new ones. Therefore, we need to
      reset the transaction's read and commit timestamps back to default values `kMaxTxnTimestamp`.
      Otherwise, `CommitAndTryCreateSnapshot()` may fail with "Status::InvalidArgument("Different commit ts specified")".
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10677
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D39513543
      
      Pulled By: riversand963
      
      fbshipit-source-id: bea01cac149bff3a23a2978fc0c3b198243a6291
      832fd644
    • L
      Add comments describing {Put,Get}Entity, update/clarify comment for Get and iterator (#10676) · 87c8bb4b
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10676
      
      Reviewed By: riversand963
      
      Differential Revision: D39512081
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 55704478ceb8081003eceeb0c5a3875cb806587e
      87c8bb4b
    • A
      Bypass a MultiGet test when async_io is used (#10669) · bb9a6d4e
      anand76 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10669
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39492658
      
      Pulled By: anand1976
      
      fbshipit-source-id: abef79808e30762654680f7dd7e46487c631febc
      bb9a6d4e
  4. 14 9月, 2022 3 次提交
    • A
      Change MultiGet multi-level optimization to default on (#10671) · 7b11d484
      anand76 提交于
      Summary:
      Change the ```ReadOptions.optimize_multiget_for_io``` flag to default on. It doesn't impact regular MultiGet users as its only applicable when ```ReadOptions.async_io``` is also set to true.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10671
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39477439
      
      Pulled By: anand1976
      
      fbshipit-source-id: 47abcdbfa69f9bc60422ab68a238b232e085d4ba
      7b11d484
    • L
      Add wide-column support to iterators (#10670) · 06ab0a8b
      Levi Tamasi 提交于
      Summary:
      The patch extends the iterator API with a new `columns` method which
      can be used to retrieve all wide columns for the current key. Similarly to
      the `Get` and `GetEntity` APIs, the classic `value` API returns the value
      of the default (anonymous) column for wide-column entities, and `columns`
      returns an entity with a single default column for plain old key-values.
      (The goal here is to maintain the invariant that `value()` is the same as
      the value of the default column in `columns()`.) The patch also involves a
      smaller refactoring: historically, `value()` was implemented using a bunch
      of conditions, that is, the `Slice` to be returned was decided based on the
      direction of the iteration, whether a merge had been done etc. when the
      method was called; with the patch, the value to be exposed is stored in a
      member `Slice value_` when the iterator lands on a new key, and `value()`
      simply returns this `Slice`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10670
      
      Test Plan: Ran `make check` and a simple blackbox crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D39475551
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 29e7a6ed9ef340841aab36803b832b7c8f668b0b
      06ab0a8b
    • C
      Cache fragmented range tombstone list for mutable memtables (#10547) · f291eefb
      Changyu Bi 提交于
      Summary:
      Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.
      
      The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547
      
      Test Plan:
      - TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
      - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
      ```
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
      ```
      Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.
      
      micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
      |   |fillrandom main           | write path caching          | read path caching          |memtable V3 (https://github.com/facebook/rocksdb/issues/10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
      |---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
      | 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
      | 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
      | 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
      | 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
      | 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
      | 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
      | 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |
      
        - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`
      
      readrandom micros/op:
      |  |main            |write path caching           |read path caching            |memtable V3      |
      |---|---|---|---|---|
      | single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
      | 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |
      
      Reviewed By: ajkr
      
      Differential Revision: D38895410
      
      Pulled By: cbi42
      
      fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
      f291eefb
  5. 13 9月, 2022 4 次提交
    • A
      Async optimization in scan path (#10602) · 03fc4397
      Akanksha Mahajan 提交于
      Summary:
      Optimizations
      1. In FilePrefetchBuffer, when data is overlapping between two buffers, it copies the data from first to third buffer, then from
      second to third buffer to return continuous buffer. This optimization will call ReadAsync on first buffer as soon as buffer is empty instead of getting blocked by second buffer to copy the data.
      2. For fixed size readahead_size, FilePrefetchBuffer will issues two async read calls. One with length + readahead_size_/2 on first buffer(if buffer is empty) and readahead_size_/2 on second buffer during seek.
      
      - Add readahead_size to db_stress for stress testing these changes in https://github.com/facebook/rocksdb/pull/10632
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10602
      
      Test Plan:
      - CircleCI tests
      - stress_test completed successfully
      export CRASH_TEST_EXT_ARGS="--async_io=1"
      make crash_test -j32
      - db_bench showed no regression
         With this PR:
      ```
       ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
      Set seed to 1661876074584472 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      Integrated BlobDB: blob cache disabled
      RocksDB:    version 7.7.0
      Date:       Tue Aug 30 09:14:34 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    50000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    25939.9 MB (estimated)
      FileSize:   13732.9 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main1]
      seekrandom   :  270878.018 micros/op 3 ops/sec 30.068 seconds 111 operations;  618.7 MB/s (111 of 111 found)
      
       ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=true -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
      Set seed to 1661875332862922 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      Integrated BlobDB: blob cache disabled
      RocksDB:    version 7.7.0
      Date:       Tue Aug 30 09:02:12 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    50000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    25939.9 MB (estimated)
      FileSize:   13732.9 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main1]
      seekrandom   :  358352.488 micros/op 2 ops/sec 30.102 seconds 84 operations;  474.4 MB/s (84 of 84 found)
      ```
      
      Without PR in main:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
      Set seed to 1661876425983045 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      Integrated BlobDB: blob cache disabled
      RocksDB:    version 7.7.0
      Date:       Tue Aug 30 09:20:26 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    50000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    25939.9 MB (estimated)
      FileSize:   13732.9 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main1]
      seekrandom   :  280881.953 micros/op 3 ops/sec 30.054 seconds 107 operations;  605.2 MB/s (107 of 107 found)
      
       ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=0
      Set seed to 1661876475267771 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      Integrated BlobDB: blob cache disabled
      RocksDB:    version 7.7.0
      Date:       Tue Aug 30 09:21:15 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    50000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    25939.9 MB (estimated)
      FileSize:   13732.9 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main1]
      seekrandom   :  363155.084 micros/op 2 ops/sec 30.142 seconds 83 operations;  468.1 MB/s (83 of 83 found)
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D39141328
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 560655922c1a437a8569c228abb31b8c0b413120
      03fc4397
    • A
      db_stress option to preserve all files until verification success (#10659) · 03c4ea26
      Andrew Kryczka 提交于
      Summary:
      In `db_stress`, DB and expected state files containing changes leading up to a verification failure are often deleted, which makes debugging such failures difficult. On the DB side, flushed WAL files and compacted SST files are marked obsolete and then deleted. Without those files, we cannot pinpoint where a key that failed verification changed unexpectedly. On the expected state side, files for verifying prefix-recoverability in the presence of unsynced data loss are deleted before verification. These include a baseline state file containing the expected state at the time of the last successful verification, and a trace file containing all operations since then. Without those files, we cannot know the sequence of DB operations expected to be recovered.
      
      This PR attempts to address this gap with a new `db_stress` flag: `preserve_unverified_changes`. Setting `preserve_unverified_changes=1` has two effects.
      
      First, prior to startup verification, `db_stress` hardlinks all DB and expected state files in "unverified/" subdirectories of `FLAGS_db` and `FLAGS_expected_values_dir`. The separate directories are needed because the pre-verification opening process deletes files written by the previous `db_stress` run as described above. These "unverified/" subdirectories are cleaned up following startup verification success.
      
      I considered other approaches for preserving DB files through startup verification, like using a read-only DB or preventing deletion of DB files externally, e.g., in the `Env` layer. However, I decided against it since such an approach would not work for expected state files, and I did not want to change the DB management logic. If there were a way to disable DB file deletions before regular DB open, I would have preferred to use that.
      
      Second, `db_stress` attempts to keep all DB and expected state files that were live at some point since the start of the `db_stress` run. This is a bit tricky and involves the following changes.
      
      - Open the DB with `disable_auto_compactions=1` and `avoid_flush_during_recovery=1`
      - DisableFileDeletions()
      - EnableAutoCompactions()
      
      For this part, too, I would have preferred to use a hypothetical API that disables DB file deletion before regular DB open.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10659
      
      Reviewed By: hx235
      
      Differential Revision: D39407454
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6e981025c7dce147649d2e770728471395a7fa53
      03c4ea26
    • A
      Fix stress test failure for async_io (#10660) · bd2ad2f9
      Akanksha Mahajan 提交于
      Summary:
      Sanitize initial_auto_readahead_size if its greater than max_auto_readahead_size in case of async_io
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10660
      
      Test Plan: Ran db_stress with intitial_auto_readahead_size  greater than max_auto_readahead_size.
      
      Reviewed By: anand1976
      
      Differential Revision: D39408095
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 07f933242f636cfbc7ccf042e0c8b959a8ec5f3a
      bd2ad2f9
    • H
      Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291) · f79b3d19
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition
      
      **Testing:**
      - `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
          - flaky tests caused by db mutex related race condition
             - Reverted https://github.com/facebook/rocksdb/pull/9528
             - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
             - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
             - This proves our injection can expose flaky tests caused by db mutex related race condition faster.
          -  known or new race-condition-type of internal bug by continuously running this PR
      - Performance
         - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
      ```
      TEST_TMPDIR=$db /usr/bin/time ./db_stress \
      --ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
      --write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
      ```
        - Start-up time:  COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
      ```
      TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
      
      TEST_TMPDIR=$db /usr/bin/time ./db_stress \
      --ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291
      
      Reviewed By: ajkr
      
      Differential Revision: D39231182
      
      Pulled By: hx235
      
      fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
      f79b3d19
  6. 12 9月, 2022 1 次提交
    • A
      Build and link libfolly with RocksDB (#10103) · be09943f
      anand76 提交于
      Summary:
      The current integration with folly requires cherry-picking folly source files to include in RocksDB for external CI builds. Its not scaleable as we depend on more features in folly, such as coroutines. This PR adds a dependency from RocksDB to the folly library when ```USE_FOLLY``` or ```USE_COROUTINES``` are set. We build folly using the build scripts in ```third-party/folly```, relying on it to download and build its dependencies. A new ```Makefile``` target, ```build_folly```, is provided to make building folly easier.
      
      A new option, ```USE_FOLLY_LITE``` is added to retain the old model of compiling selected folly sources with RocksDB. This might be useful for short-term development.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10103
      
      Reviewed By: pdillinger
      
      Differential Revision: D38426787
      
      Pulled By: anand1976
      
      fbshipit-source-id: 33bc84abd9fdc7e2567749f02aa1b2494eb62b2f
      be09943f
  7. 10 9月, 2022 2 次提交
  8. 09 9月, 2022 4 次提交
    • A
      minor cleanups to db_crashtest.py (#10654) · 4100eb30
      Andrew Kryczka 提交于
      Summary:
      Expanded `all_params` to include all parameters crash test may set. Previously, `atomic_flush` was not included in `all_params` and thus was not visible to `finalize_and_sanitize()`. The consequence was manual crash test runs could provide unsafe combinations of parameters to `db_stress`. For example, running `db_crashtest.py` with `-atomic_flush=0` could cause `db_stress` to run with `-atomic_flush=0 -disable_wal=1`, which is known to produce inconsistencies across column families.
      
      While expanding `all_params`, I found we cannot have an entry in it for both `db_stress` and `db_crashtest.py`. So I renamed `enable_tiered_storage` to `test_tiered_storage` for `db_crashtest.py`, which appears more conventional anyways.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10654
      
      Reviewed By: hx235
      
      Differential Revision: D39369349
      
      Pulled By: ajkr
      
      fbshipit-source-id: 31d9010c760c868b20d5e9bd78ba75c8ff3ce348
      4100eb30
    • G
      Add PerfContext counters for CompressedSecondaryCache (#10650) · 0148c493
      gitbw95 提交于
      Summary:
      Add PerfContext counters for CompressedSecondaryCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10650
      
      Test Plan: Unit Tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D39354712
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 1b90d3df99d08ddecd351edfd48d1e3723fdbc15
      0148c493
    • Y
      Fix overlapping check by excluding timestamp (#10615) · 3d67d791
      Yanqin Jin 提交于
      Summary:
      With user-defined timestamp, checking overlapping should exclude
      timestamp part from key. This has already been done for range checking
      for files in sstableKeyCompare(), but not yet done when checking with
      concurrent compactions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615
      
      Test Plan:
      (Will add more tests)
      
      make check
      (Repro seems easier with this commit sha: git checkout 78bbdef5)
      rm -rf /dev/shm/rocksdb/* &&
      mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected &&
      ./db_stress
      --allow_data_in_errors=True --clear_column_family_one_in=0
      --continuous_verification_interval=0 --data_block_index_type=1
      --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
      --delrangepercent=0
      --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
      --iterpercent=0 --max_background_compactions=20
      --max_bytes_for_level_base=10485760 --max_key=25000000
      --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
      --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0
      --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
      --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3
      --target_file_size_base=65536 --target_file_size_multiplier=2
      --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
      --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
      --write_buffer_size=65536 --writepercent=60 -disable_wal=1
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39146797
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
      3d67d791
    • L
      Eliminate some allocations/copies around the blob cache (#10647) · fe56cb9a
      Levi Tamasi 提交于
      Summary:
      Historically, `BlobFileReader` has returned the blob(s) read from the file
      in the `PinnableSlice` provided by the client. This interface was
      preserved when caching was implemented for blobs, which meant that
      the blob data was copied multiple times when caching was in use: first,
      into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`),
      and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`).
      The patch eliminates these copies and the related allocations by changing
      `BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents`
      objects that can be directly inserted into the cache. The allocations backing
      these `BlobContents` objects are made using the blob cache's allocator if the
      blobs are to be inserted into the cache (i.e. if a cache is configured and the
      `fill_cache` read option is set). Note: this PR focuses on the common case when
      blobs are compressed; some further small optimizations are possible for uncompressed
      blobs.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D39335185
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36
      fe56cb9a
  9. 08 9月, 2022 2 次提交
    • P
      Always verify SST unique IDs on SST file open (#10532) · 6de7081c
      Peter Dillinger 提交于
      Summary:
      Although we've been tracking SST unique IDs in the DB manifest
      unconditionally, checking has been opt-in and with an extra pass at DB::Open
      time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
      check unique ID against manifest every time an SST file is opened through
      table cache (normal DB operations), replacing the explicit pass over files
      at DB::Open time. This change also enables the option by default and
      removes the "EXPERIMENTAL" designation.
      
      One possible criticism is that the option no longer ensures the integrity
      of a DB at Open time. This is far from an all-or-nothing issue. Verifying
      the IDs of all SST files hardly ensures all the data in the DB is readable.
      (VerifyChecksum is supposed to do that.) Also, with
      max_open_files=-1 (default, extremely common), all SST files are
      opened at DB::Open time anyway.
      
      Implementation details:
      * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
      that is now removed.
      * Unit tests that manipulate/corrupt table properties have to opt out of
      this check, because that corrupts the "actual" unique id. (And even for
      testing we don't currently have a mechanism to set "no unique id"
      in the in-memory file metadata for new files.)
      * A lot of other unit test churn relates to (a) default checking on, and
      (b) checking on SST open even without DB::Open (e.g. on flush)
      * Use `FileMetaData` for more `TableCache` operations (in place of
      `FileDescriptor`) so that we have access to the unique_id whenever
      we might need to open an SST file. **There is the possibility of
      performance impact because we can no longer use the more
      localized `fd` part of an `FdWithKeyRange` but instead follow the
      `file_metadata` pointer. However, this change (possible regression)
      is only done for `GetMemoryUsageByTableReaders`.**
      * Removed a completely unnecessary constructor overload of
      `TableReaderOptions`
      
      Possible follow-up:
      * Verification only happens when opening through table cache. Are there
      more places where this should happen?
      * Improve error message when there is a file size mismatch vs. manifest
      (FIXME added in the appropriate place).
      * I'm not sure there's a justification for `FileDescriptor` to be distinct from
      `FileMetaData`.
      * I'm skeptical that `FdWithKeyRange` really still makes sense for
      optimizing some data locality by duplicating some data in memory, but I
      could be wrong.
      * An unnecessary overload of NewTableReader was recently added, in
      the public API nonetheless (though unusable there). It should be cleaned
      up to put most things under `TableReaderOptions`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
      
      Test Plan:
      updated unit tests
      
      Performance test showing no significant difference (just noise I think):
      `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
      Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
      After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38765551
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
      6de7081c
    • B
      Avoid recompressing cold block in CompressedSecondaryCache (#10527) · d490bfcd
      Bo Wang 提交于
      Summary:
      **Summary:**
      When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.
      
      When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.
      
      **Implementation Details**
      Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)
      
      The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
      1. If a handle is found in primary cache:
        1.1. If the handle's value is not nullptr, it is returned immediately.
        1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
          - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
          - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
      2. If a handle is not found in primary cache:
        2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
        2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.
      
      The behaviors of `LRUCacheShard::Promote()` are updated as follows:
      1. If `e->sec_handle` has value, one of the following steps can happen:
        1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
        1.2. Insert the item into the primary cache and return the handle to caller.
        1.3. Exception handling.
      3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.
      
      The behavior of  `CompressedSecondaryCache::Insert()` is updated:
      1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
      4. If a dummy item is found for a block, the block is inserted into the secondary cache.
      
      The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
      1. If a handle is not found or it is a dummy item, a nullptr is returned.
      2. If `erase_handle` is true, the handle is erased.
      
      The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527
      
      Test Plan:
      1. stress tests.
      5. unit tests.
      6. CPU profiling for db_bench.
      
      Reviewed By: siying
      
      Differential Revision: D38747613
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
      d490bfcd
  10. 07 9月, 2022 3 次提交
    • L
      Support custom allocators for the blob cache (#10628) · c8543296
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10628
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39228165
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 591fdff08db400b170b26f0165551f86d33c1dbf
      c8543296
    • A
      Deflake blob caching tests (#10636) · 5a97e6b1
      Andrew Kryczka 提交于
      Summary:
      Example failure:
      
      ```
      db/blob/db_blob_basic_test.cc:226: Failure
      Expected equality of these values:
        i
          Which is: 1
        num_blobs
          Which is: 5
      ```
      
      I can't repro locally, but it looks like the 2KB cache is too small to guarantee no eviction happens between loading all the data into cache and reading from `kBlockCacheTier`. This 2KB setting appears to have come from a test where the cached entries are pinned, where it makes sense to have a small setting. However, such a small setting makes less sense when the blocks are evictable but must remain cached per the test's expectation. This PR increases the capacity setting to 2MB for those cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10636
      
      Reviewed By: cbi42
      
      Differential Revision: D39250976
      
      Pulled By: ajkr
      
      fbshipit-source-id: 769309f9a19cfac20b67b927805c8df5c1d2d1f5
      5a97e6b1
    • A
      Deflake DBErrorHandlingFSTest.*WALWriteError (#10642) · 1ffadbe9
      Andrew Kryczka 提交于
      Summary:
      Example flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/17660/workflows/7a891875-f07b-4a67-b204-eaa7ca9f9aa2/jobs/467496
      
      The test could get stuck in out-of-space due to a callback executing `SetFilesystemActive(false /* active */)` after the test executed `SetFilesystemActive(true /* active */)`. This could happen because background info logging went through the SyncPoint callback "WritableFileWriter::Append:BeforePrepareWrite", probably unintentionally. The solution of this PR is to call `ClearAllCallBacks()` to wait for any such pending callbacks to drain before calling `SetFilesystemActive(true /* active */)`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10642
      
      Reviewed By: cbi42
      
      Differential Revision: D39265381
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9a2f4916ab19726c8fb4b3a3b590b1b9ed93de1b
      1ffadbe9
  11. 06 9月, 2022 1 次提交
  12. 05 9月, 2022 2 次提交
    • A
      Disable RateLimiterTest.Rate with valgrind (#10637) · 36dec11b
      Andrew Kryczka 提交于
      Summary:
      Example valgrind flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/18073/workflows/3794e569-45cb-4621-a2b4-df1dcdf5cb19/jobs/475569
      
      ```
      util/rate_limiter_test.cc:358
      Expected equality of these values:
        samples_at_minimum
          Which is: 9
        samples
          Which is: 10
      ```
      
      Some other runs of `RateLimiterTest.Rate` already skip this check due to its reliance on a minimum execution speed. We know valgrind slows execution a lot so can disable the check in that case.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10637
      
      Reviewed By: cbi42
      
      Differential Revision: D39251350
      
      Pulled By: ajkr
      
      fbshipit-source-id: 41ae1ea4cd91992ea57df902f9f7fd6d182a5932
      36dec11b
    • A
      Deflake DBBlockCacheTest1.WarmCacheWithBlocksDuringFlush (#10635) · fe5fbe32
      Andrew Kryczka 提交于
      Summary:
      Previously, automatic compaction could be triggered prior to the test invoking CompactRange(). It could lead to the following flaky failure:
      
      ```
      /root/project/db/db_block_cache_test.cc:753: Failure
      Expected equality of these values:
        1 + kNumBlocks
          Which is: 11
        options.statistics->getTickerCount(BLOCK_CACHE_INDEX_ADD)
          Which is: 10
      ```
      
      A sequence leading to this failure was:
      
      * Automatic compaction
        * files [1] [2] trivially moved
        * files [3] [4] [5] [6] trivially moved
      * CompactRange()
        * files [7] [8] [9] trivially moved
        * file [10] trivially moved
      
      In such a case, the index/filter block adds that the test expected did not happen since there were no new files.
      
      This PR just tweaks settings to ensure the `CompactRange()` produces one new file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10635
      
      Reviewed By: cbi42
      
      Differential Revision: D39250869
      
      Pulled By: ajkr
      
      fbshipit-source-id: a3c94c49069e28c49c40b4b80dae0059739d19fd
      fe5fbe32
  13. 03 9月, 2022 1 次提交
    • C
      Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) · 30bc495c
      Changyu Bi 提交于
      Summary:
      Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
      
      With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
      - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
      - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
      
      This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
      
      One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
      
      Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
      
      Test Plan:
      - Added many unit tests in db_range_del_test
      - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
      - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
      
      ```
      python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
      ```
      
      - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
      ```
      # Setup:
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
      
      # Scan entire DB
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
      
      # Short range scan (10 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
      
      # Long range scan(1000 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
      ```
      Avg over of 10 runs (some slower tests had fews runs):
      
      For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
      
      - Scan entire DB
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
      | 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
      | 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
      | 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
      | 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
      - Short range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
      | 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
      | 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
      | 10000 |1367                 |28667                |+1997.07%            |
      | 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |
      
      - Long range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
      | 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
      | 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
      | 10000 |25                   |414                  |+1556.0%             |
      | 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |
      
      - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
      
      Reviewed By: ajkr
      
      Differential Revision: D38450331
      
      Pulled By: cbi42
      
      fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
      30bc495c