1. 16 9月, 2022 3 次提交
    • P
      Revamp, optimize new experimental clock cache (#10626) · 57243486
      Peter Dillinger 提交于
      Summary:
      * Consolidates most metadata into a single word per slot so that more
      can be accomplished with a single atomic update. In the common case,
      Lookup was previously about 4 atomic updates, now just 1 atomic update.
      Common case Release was previously 1 atomic read + 1 atomic update,
      now just 1 atomic update.
      * Eliminate spins / waits / yields, which likely threaten some "lock free"
      benefits. Compare-exchange loops are only used in explicit Erase, and
      strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
      exchange.
      * Relaxes some aggressiveness and guarantees. For example,
        * Duplicate Inserts will sometimes go undetected and the shadow duplicate
          will age out with eviction.
        * In many cases, the older Inserted value for a given cache key will be kept
        (i.e. Insert does not support overwrite).
        * Entries explicitly erased (rather than evicted) might not be freed
        immediately in some rare cases.
        * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
        precisely as LRUCache, but is self-correcting and should only deviate by a
        very small number of extra or fewer entries.
      * Use smaller "computed default" number of cache shards in many cases,
      because benefits to larger usage tracking / eviction pools outweigh the small
      cost of more lock-free atomic contention. The improvement in CPU and I/O
      is dramatic in some limit-memory cases.
      * Even without the sharding change, the eviction algorithm is likely more
      effective than LRU overall because it's more stateful, even though the
      "hot path" state tracking for it is essentially free with ref counting. It
      is like a generalized CLOCK with aging (see code comments). I don't have
      performance numbers showing a specific improvement, but in theory, for a
      Poisson access pattern to each block, keeping some state allows better
      estimation of time to next access (Poisson interval) than strict LRU. The
      bounded randomness in CLOCK can also reduce "cliff" effect for repeated
      range scans approaching and exceeding cache size.
      
      ## Hot path algorithm comparison
      Rough descriptions, focusing on number and kind of atomic operations:
      * Old `Lookup()` (2-5 atomic updates per probe):
      ```
      Loop:
        Increment internal ref count at slot
        If possible hit:
          Check flags atomic (and non-atomic fields)
          If cache hit:
            Three distinct updates to 'flags' atomic
            Increment refs for internal-to-external
            Return
        Decrement internal ref count
      while atomic read 'displacements' > 0
      ```
      * New `Lookup()` (1-2 atomic updates per probe):
      ```
      Loop:
        Increment acquire counter in meta word (optimistic)
        If visible entry (already read meta word):
          If match (read non-atomic fields):
            Return
          Else:
            Decrement acquire counter in meta word
        Else if invisible entry (rare, already read meta word):
          Decrement acquire counter in meta word
      while atomic read 'displacements' > 0
      ```
      * Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
      ```
      Read atomic ref count
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      Else:
        Decrement ref count
      ```
      * New `Release()` (1 unconditional atomic update, rarely more):
      ```
      Increment release counter in meta word
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      ```
      
      ## Performance test setup
      Build DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
      ```
      Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
      
      base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
      folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
      gt_clock: experimental ClockCache before this change
      new_clock: experimental ClockCache with this change
      
      ## Performance test results
      First test "hot path" read performance, with block cache large enough for whole DB:
      4181MB 1thread base -> kops/s: 47.761
      4181MB 1thread folly -> kops/s: 45.877
      4181MB 1thread gt_clock -> kops/s: 51.092
      4181MB 1thread new_clock -> kops/s: 53.944
      
      4181MB 16thread base -> kops/s: 284.567
      4181MB 16thread folly -> kops/s: 249.015
      4181MB 16thread gt_clock -> kops/s: 743.762
      4181MB 16thread new_clock -> kops/s: 861.821
      
      4181MB 24thread base -> kops/s: 303.415
      4181MB 24thread folly -> kops/s: 266.548
      4181MB 24thread gt_clock -> kops/s: 975.706
      4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
      
      4181MB 32thread base -> kops/s: 311.251
      4181MB 32thread folly -> kops/s: 274.952
      4181MB 32thread gt_clock -> kops/s: 1045.98
      4181MB 32thread new_clock -> kops/s: 1370.38
      
      4181MB 48thread base -> kops/s: 310.504
      4181MB 48thread folly -> kops/s: 268.322
      4181MB 48thread gt_clock -> kops/s: 1195.65
      4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
      
      4181MB 64thread base -> kops/s: 307.839
      4181MB 64thread folly -> kops/s: 272.172
      4181MB 64thread gt_clock -> kops/s: 1204.47
      4181MB 64thread new_clock -> kops/s: 1615.37
      
      4181MB 128thread base -> kops/s: 310.934
      4181MB 128thread folly -> kops/s: 267.468
      4181MB 128thread gt_clock -> kops/s: 1188.75
      4181MB 128thread new_clock -> kops/s: 1595.46
      
      Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
      
      Now test a large block cache with low miss ratio, but some eviction is required:
      1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
      1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
      1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
      1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
      
      1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
      1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
      1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
      1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
      
      1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
      1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
      1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
      1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
      
      610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
      610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
      610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
      610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
      
      610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
      610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
      610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
      610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
      
      610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
      610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
      610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
      610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
      
      The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
      
      Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
      
      233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
      233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
      233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
      233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
      
      233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
      233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
      233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
      233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
      
      233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
      233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
      233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
      233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
      
      89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
      89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
      89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
      89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
      
      89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
      89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
      89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
      89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
      
      89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
      89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
      89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
      89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
      
      ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
      
      Even smaller cache size:
      34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
      34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
      34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
      34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
      
      34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
      34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
      34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
      34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
      
      34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
      34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
      34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
      34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
      
      As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
      
      13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
      13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
      13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
      13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
      
      13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
      13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
      13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
      13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
      
      13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
      13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
      13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
      13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
      
      gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
      
      13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
      13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
      13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
      13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
      
      13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
      13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
      13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
      13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
      
      Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
      
      Reviewed By: anand1976
      
      Differential Revision: D39368406
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
      57243486
    • A
      Fix some MultiGet stats (#10673) · 37b75e13
      anand76 提交于
      Summary:
      The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39492615
      
      Pulled By: anand1976
      
      fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3
      37b75e13
    • A
      Fix a MultiGet crash (#10688) · c206aebd
      anand76 提交于
      Summary:
      Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688
      
      Reviewed By: gitbw95
      
      Differential Revision: D39556131
      
      Pulled By: anand1976
      
      fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458
      c206aebd
  2. 15 9月, 2022 4 次提交
    • J
      Refactor Compaction file cut `ShouldStopBefore()` (#10629) · 849cf1bf
      Jay Zhuang 提交于
      Summary:
      Consolidate compaction output cut logic to `ShouldStopBefore()` and move
      it inside of CompactionOutputs class.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629
      
      Reviewed By: cbi42
      
      Differential Revision: D39315536
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18
      849cf1bf
    • Y
      Fix a bug by setting up subcompaction bounds properly (#10658) · ce2c11d8
      Yanqin Jin 提交于
      Summary:
      When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658
      
      Test Plan:
      ```bash
      make check
      rm -rf /dev/shm/rocksdb/* && mkdir
      /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress
      --allow_data_in_errors=True --clear_column_family_one_in=0
      --continuous_verification_interval=0 --data_block_index_type=1
      --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
      --delrangepercent=0
      --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
      --iterpercent=0 --max_background_compactions=20
      --max_bytes_for_level_base=10485760 --max_key=25000000
      --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
      --ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0
      --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
      --snapshot_hold_ops=100000 --subcompactions=4
      --target_file_size_base=65536 --target_file_size_multiplier=2
      --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
      --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
      --write_buffer_size=65536 --writepercent=60 -disable_wal=1
      -column_families=1
      ```
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39393402
      
      Pulled By: riversand963
      
      fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871
      ce2c11d8
    • C
      Fix data race in accessing `cached_range_tombstone_` (#10680) · be04a3b6
      Changyu Bi 提交于
      Summary:
      fix a data race introduced in https://github.com/facebook/rocksdb/issues/10547 (P5295241720), first reported by pdillinger. The race is between the `std::atomic_load_explicit` in NewRangeTombstoneIteratorInternal and the `std::atomic_store_explicit` in MemTable::Add() that operate on `cached_range_tombstone_`. P5295241720 shows that `atomic_store_explicit` initializes some mutex which `atomic_load_explicit` could be trying to call `lock()` on at the same time. This fix moves the initialization to memtable constructor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10680
      
      Test Plan: `USE_CLANG=1 COMPILE_WITH_TSAN=1 make -j24 whitebox_crash_test`
      
      Reviewed By: ajkr
      
      Differential Revision: D39528696
      
      Pulled By: cbi42
      
      fbshipit-source-id: ee740841044438e18ad2b8ea567444dd542dd8e2
      be04a3b6
    • A
      Bypass a MultiGet test when async_io is used (#10669) · bb9a6d4e
      anand76 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10669
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39492658
      
      Pulled By: anand1976
      
      fbshipit-source-id: abef79808e30762654680f7dd7e46487c631febc
      bb9a6d4e
  3. 14 9月, 2022 2 次提交
    • L
      Add wide-column support to iterators (#10670) · 06ab0a8b
      Levi Tamasi 提交于
      Summary:
      The patch extends the iterator API with a new `columns` method which
      can be used to retrieve all wide columns for the current key. Similarly to
      the `Get` and `GetEntity` APIs, the classic `value` API returns the value
      of the default (anonymous) column for wide-column entities, and `columns`
      returns an entity with a single default column for plain old key-values.
      (The goal here is to maintain the invariant that `value()` is the same as
      the value of the default column in `columns()`.) The patch also involves a
      smaller refactoring: historically, `value()` was implemented using a bunch
      of conditions, that is, the `Slice` to be returned was decided based on the
      direction of the iteration, whether a merge had been done etc. when the
      method was called; with the patch, the value to be exposed is stored in a
      member `Slice value_` when the iterator lands on a new key, and `value()`
      simply returns this `Slice`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10670
      
      Test Plan: Ran `make check` and a simple blackbox crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D39475551
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 29e7a6ed9ef340841aab36803b832b7c8f668b0b
      06ab0a8b
    • C
      Cache fragmented range tombstone list for mutable memtables (#10547) · f291eefb
      Changyu Bi 提交于
      Summary:
      Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.
      
      The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547
      
      Test Plan:
      - TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
      - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
      ```
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
      ```
      Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.
      
      micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
      |   |fillrandom main           | write path caching          | read path caching          |memtable V3 (https://github.com/facebook/rocksdb/issues/10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
      |---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
      | 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
      | 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
      | 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
      | 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
      | 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
      | 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
      | 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |
      
        - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`
      
      readrandom micros/op:
      |  |main            |write path caching           |read path caching            |memtable V3      |
      |---|---|---|---|---|
      | single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
      | 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |
      
      Reviewed By: ajkr
      
      Differential Revision: D38895410
      
      Pulled By: cbi42
      
      fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
      f291eefb
  4. 13 9月, 2022 1 次提交
    • H
      Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291) · f79b3d19
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition
      
      **Testing:**
      - `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
          - flaky tests caused by db mutex related race condition
             - Reverted https://github.com/facebook/rocksdb/pull/9528
             - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
             - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
             - This proves our injection can expose flaky tests caused by db mutex related race condition faster.
          -  known or new race-condition-type of internal bug by continuously running this PR
      - Performance
         - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
      ```
      TEST_TMPDIR=$db /usr/bin/time ./db_stress \
      --ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
      --write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
      ```
        - Start-up time:  COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
      ```
      TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
      
      TEST_TMPDIR=$db /usr/bin/time ./db_stress \
      --ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291
      
      Reviewed By: ajkr
      
      Differential Revision: D39231182
      
      Pulled By: hx235
      
      fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
      f79b3d19
  5. 09 9月, 2022 2 次提交
    • Y
      Fix overlapping check by excluding timestamp (#10615) · 3d67d791
      Yanqin Jin 提交于
      Summary:
      With user-defined timestamp, checking overlapping should exclude
      timestamp part from key. This has already been done for range checking
      for files in sstableKeyCompare(), but not yet done when checking with
      concurrent compactions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615
      
      Test Plan:
      (Will add more tests)
      
      make check
      (Repro seems easier with this commit sha: git checkout 78bbdef5)
      rm -rf /dev/shm/rocksdb/* &&
      mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected &&
      ./db_stress
      --allow_data_in_errors=True --clear_column_family_one_in=0
      --continuous_verification_interval=0 --data_block_index_type=1
      --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
      --delrangepercent=0
      --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
      --iterpercent=0 --max_background_compactions=20
      --max_bytes_for_level_base=10485760 --max_key=25000000
      --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
      --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0
      --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
      --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3
      --target_file_size_base=65536 --target_file_size_multiplier=2
      --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
      --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
      --write_buffer_size=65536 --writepercent=60 -disable_wal=1
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39146797
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
      3d67d791
    • L
      Eliminate some allocations/copies around the blob cache (#10647) · fe56cb9a
      Levi Tamasi 提交于
      Summary:
      Historically, `BlobFileReader` has returned the blob(s) read from the file
      in the `PinnableSlice` provided by the client. This interface was
      preserved when caching was implemented for blobs, which meant that
      the blob data was copied multiple times when caching was in use: first,
      into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`),
      and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`).
      The patch eliminates these copies and the related allocations by changing
      `BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents`
      objects that can be directly inserted into the cache. The allocations backing
      these `BlobContents` objects are made using the blob cache's allocator if the
      blobs are to be inserted into the cache (i.e. if a cache is configured and the
      `fill_cache` read option is set). Note: this PR focuses on the common case when
      blobs are compressed; some further small optimizations are possible for uncompressed
      blobs.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D39335185
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36
      fe56cb9a
  6. 08 9月, 2022 2 次提交
    • P
      Always verify SST unique IDs on SST file open (#10532) · 6de7081c
      Peter Dillinger 提交于
      Summary:
      Although we've been tracking SST unique IDs in the DB manifest
      unconditionally, checking has been opt-in and with an extra pass at DB::Open
      time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
      check unique ID against manifest every time an SST file is opened through
      table cache (normal DB operations), replacing the explicit pass over files
      at DB::Open time. This change also enables the option by default and
      removes the "EXPERIMENTAL" designation.
      
      One possible criticism is that the option no longer ensures the integrity
      of a DB at Open time. This is far from an all-or-nothing issue. Verifying
      the IDs of all SST files hardly ensures all the data in the DB is readable.
      (VerifyChecksum is supposed to do that.) Also, with
      max_open_files=-1 (default, extremely common), all SST files are
      opened at DB::Open time anyway.
      
      Implementation details:
      * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
      that is now removed.
      * Unit tests that manipulate/corrupt table properties have to opt out of
      this check, because that corrupts the "actual" unique id. (And even for
      testing we don't currently have a mechanism to set "no unique id"
      in the in-memory file metadata for new files.)
      * A lot of other unit test churn relates to (a) default checking on, and
      (b) checking on SST open even without DB::Open (e.g. on flush)
      * Use `FileMetaData` for more `TableCache` operations (in place of
      `FileDescriptor`) so that we have access to the unique_id whenever
      we might need to open an SST file. **There is the possibility of
      performance impact because we can no longer use the more
      localized `fd` part of an `FdWithKeyRange` but instead follow the
      `file_metadata` pointer. However, this change (possible regression)
      is only done for `GetMemoryUsageByTableReaders`.**
      * Removed a completely unnecessary constructor overload of
      `TableReaderOptions`
      
      Possible follow-up:
      * Verification only happens when opening through table cache. Are there
      more places where this should happen?
      * Improve error message when there is a file size mismatch vs. manifest
      (FIXME added in the appropriate place).
      * I'm not sure there's a justification for `FileDescriptor` to be distinct from
      `FileMetaData`.
      * I'm skeptical that `FdWithKeyRange` really still makes sense for
      optimizing some data locality by duplicating some data in memory, but I
      could be wrong.
      * An unnecessary overload of NewTableReader was recently added, in
      the public API nonetheless (though unusable there). It should be cleaned
      up to put most things under `TableReaderOptions`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
      
      Test Plan:
      updated unit tests
      
      Performance test showing no significant difference (just noise I think):
      `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
      Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
      After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38765551
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
      6de7081c
    • B
      Avoid recompressing cold block in CompressedSecondaryCache (#10527) · d490bfcd
      Bo Wang 提交于
      Summary:
      **Summary:**
      When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.
      
      When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.
      
      **Implementation Details**
      Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)
      
      The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
      1. If a handle is found in primary cache:
        1.1. If the handle's value is not nullptr, it is returned immediately.
        1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
          - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
          - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
      2. If a handle is not found in primary cache:
        2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
        2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.
      
      The behaviors of `LRUCacheShard::Promote()` are updated as follows:
      1. If `e->sec_handle` has value, one of the following steps can happen:
        1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
        1.2. Insert the item into the primary cache and return the handle to caller.
        1.3. Exception handling.
      3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.
      
      The behavior of  `CompressedSecondaryCache::Insert()` is updated:
      1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
      4. If a dummy item is found for a block, the block is inserted into the secondary cache.
      
      The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
      1. If a handle is not found or it is a dummy item, a nullptr is returned.
      2. If `erase_handle` is true, the handle is erased.
      
      The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527
      
      Test Plan:
      1. stress tests.
      5. unit tests.
      6. CPU profiling for db_bench.
      
      Reviewed By: siying
      
      Differential Revision: D38747613
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
      d490bfcd
  7. 07 9月, 2022 3 次提交
    • L
      Support custom allocators for the blob cache (#10628) · c8543296
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10628
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39228165
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 591fdff08db400b170b26f0165551f86d33c1dbf
      c8543296
    • A
      Deflake blob caching tests (#10636) · 5a97e6b1
      Andrew Kryczka 提交于
      Summary:
      Example failure:
      
      ```
      db/blob/db_blob_basic_test.cc:226: Failure
      Expected equality of these values:
        i
          Which is: 1
        num_blobs
          Which is: 5
      ```
      
      I can't repro locally, but it looks like the 2KB cache is too small to guarantee no eviction happens between loading all the data into cache and reading from `kBlockCacheTier`. This 2KB setting appears to have come from a test where the cached entries are pinned, where it makes sense to have a small setting. However, such a small setting makes less sense when the blocks are evictable but must remain cached per the test's expectation. This PR increases the capacity setting to 2MB for those cases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10636
      
      Reviewed By: cbi42
      
      Differential Revision: D39250976
      
      Pulled By: ajkr
      
      fbshipit-source-id: 769309f9a19cfac20b67b927805c8df5c1d2d1f5
      5a97e6b1
    • A
      Deflake DBErrorHandlingFSTest.*WALWriteError (#10642) · 1ffadbe9
      Andrew Kryczka 提交于
      Summary:
      Example flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/17660/workflows/7a891875-f07b-4a67-b204-eaa7ca9f9aa2/jobs/467496
      
      The test could get stuck in out-of-space due to a callback executing `SetFilesystemActive(false /* active */)` after the test executed `SetFilesystemActive(true /* active */)`. This could happen because background info logging went through the SyncPoint callback "WritableFileWriter::Append:BeforePrepareWrite", probably unintentionally. The solution of this PR is to call `ClearAllCallBacks()` to wait for any such pending callbacks to drain before calling `SetFilesystemActive(true /* active */)`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10642
      
      Reviewed By: cbi42
      
      Differential Revision: D39265381
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9a2f4916ab19726c8fb4b3a3b590b1b9ed93de1b
      1ffadbe9
  8. 05 9月, 2022 1 次提交
    • A
      Deflake DBBlockCacheTest1.WarmCacheWithBlocksDuringFlush (#10635) · fe5fbe32
      Andrew Kryczka 提交于
      Summary:
      Previously, automatic compaction could be triggered prior to the test invoking CompactRange(). It could lead to the following flaky failure:
      
      ```
      /root/project/db/db_block_cache_test.cc:753: Failure
      Expected equality of these values:
        1 + kNumBlocks
          Which is: 11
        options.statistics->getTickerCount(BLOCK_CACHE_INDEX_ADD)
          Which is: 10
      ```
      
      A sequence leading to this failure was:
      
      * Automatic compaction
        * files [1] [2] trivially moved
        * files [3] [4] [5] [6] trivially moved
      * CompactRange()
        * files [7] [8] [9] trivially moved
        * file [10] trivially moved
      
      In such a case, the index/filter block adds that the test expected did not happen since there were no new files.
      
      This PR just tweaks settings to ensure the `CompactRange()` produces one new file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10635
      
      Reviewed By: cbi42
      
      Differential Revision: D39250869
      
      Pulled By: ajkr
      
      fbshipit-source-id: a3c94c49069e28c49c40b4b80dae0059739d19fd
      fe5fbe32
  9. 03 9月, 2022 1 次提交
    • C
      Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) · 30bc495c
      Changyu Bi 提交于
      Summary:
      Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
      
      With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
      - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
      - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
      
      This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
      
      One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
      
      Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
      
      Test Plan:
      - Added many unit tests in db_range_del_test
      - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
      - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
      
      ```
      python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
      ```
      
      - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
      ```
      # Setup:
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
      
      # Scan entire DB
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
      
      # Short range scan (10 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
      
      # Long range scan(1000 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
      ```
      Avg over of 10 runs (some slower tests had fews runs):
      
      For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
      
      - Scan entire DB
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
      | 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
      | 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
      | 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
      | 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
      - Short range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
      | 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
      | 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
      | 10000 |1367                 |28667                |+1997.07%            |
      | 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |
      
      - Long range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
      | 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
      | 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
      | 10000 |25                   |414                  |+1556.0%             |
      | 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |
      
      - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
      
      Reviewed By: ajkr
      
      Differential Revision: D38450331
      
      Pulled By: cbi42
      
      fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
      30bc495c
  10. 02 9月, 2022 1 次提交
    • L
      Pin the newly cached blob after insert (#10625) · b07217da
      Levi Tamasi 提交于
      Summary:
      With the current code, when a blob isn't found in the cache and gets read
      from the blob file and then inserted into the cache, the application gets
      passed the self-contained `PinnableSlice` resulting from the blob file read.
      The patch changes this so that the `PinnableSlice` pins the cache entry
      instead in this case.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10625
      
      Test Plan: `make check`
      
      Reviewed By: pdillinger
      
      Differential Revision: D39220904
      
      Pulled By: ltamasi
      
      fbshipit-source-id: cb9c62881e3523b1e9f614e00bf503bac2fe3b0a
      b07217da
  11. 01 9月, 2022 2 次提交
  12. 31 8月, 2022 2 次提交
    • L
      Support using cache warming with the secondary blob cache (#10603) · 01e88dfe
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10603
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D39117952
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 5e956fa2fc18974876a5c87686acb50718e0edb7
      01e88dfe
    • H
      Add missing mutex when reading from shared variable... · 8a85946f
      Hui Xiao 提交于
      Add missing mutex when reading from shared variable bg_bottom_compaction_scheduled_, bg_compaction_scheduled_ (#10610)
      
      Summary:
      **Context/Summary:**
      According to https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_job.h#L328-L332, any reading in the form of `*bg_compaction_scheduled_` , `*bg_bottom_compaction_scheduled_` should be protected by mutex, which isn't the case for some assert statement. This leads to a data race that can be repro-ed by the following command (command coming soon)
      
      ```
      db=/dev/shm/rocksdb_crashtest_blackbox
      exp=/dev/shm/rocksdb_crashtest_expected
      rm -rf $db $exp
      mkdir -p $exp
      
      ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=1 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --value_size_mult=32 --writepercent=90  --compaction_pri=4 --use_txn=1 --level_compaction_dynamic_level_bytes=True  --compaction_ttl=0  --compact_files_one_in=1000000 --compact_range_one_in=1000000 --value_size_mult=32 --verify_db_one_in=1000  --write_buffer_size=65536 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_key=25000000 --max_key_len=3 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --target_file_size_base=2097152 --target_file_size_multiplier=2
      ```
      ```
      WARNING: ThreadSanitizer: data race (pid=73424)
        Read of size 4 at 0x7b8c0000151c by thread T13:
          #0 ReleaseSubcompactionResources internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:390 (db_stress+0x630aa3)
          https://github.com/facebook/rocksdb/issues/1 rocksdb::CompactionJob::Run() internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:741 (db_stress+0x630aa3)
          https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3436 (db_stress+0x60b2cc)
          https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2950 (db_stress+0x606d79)
          https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a)
      
        Previous write of size 4 at 0x7b8c0000151c by thread T12 (mutexes: write M438955329917552448):
          #0 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3018 (db_stress+0x6072a1)
          https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a)
      
      Location is heap block of size 6720 at 0x7b8c00000000 allocated by main thread:
          #0 operator new(unsigned long, std::align_val_t) <null> (db_stress+0xbab5bb)
          https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) internal_repo_rocksdb/repo/db/db_impl/db_impl_open.cc:1811 (db_stress+0x69769a)
          https://github.com/facebook/rocksdb/issues/2 rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::TransactionDB**) internal_repo_rocksdb/repo/utilities/transactions/pessimistic_transaction_db.cc:258 (db_stress+0x8ae1f4)
          https://github.com/facebook/rocksdb/issues/3 rocksdb::StressTest::Open(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2611 (db_stress+0x32b927)
          https://github.com/facebook/rocksdb/issues/4 rocksdb::StressTest::InitDb(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:290 (db_stress+0x34712c)
      ```
      This PR added all the missing mutex that should've been in place
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10610
      
      Test Plan:
      - Past repro command
      - Existing CI
      
      Reviewed By: riversand963
      
      Differential Revision: D39143016
      
      Pulled By: hx235
      
      fbshipit-source-id: 51dd4db55ad306f3dbda5d0dd54d6f2513cf70f2
      8a85946f
  13. 30 8月, 2022 4 次提交
    • Y
      Use std::make_unique when possible (#10578) · 7c0838e6
      Yanqin Jin 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10578
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D39064748
      
      Pulled By: riversand963
      
      fbshipit-source-id: c7c135b7b713608edb14614846050ece6d4cc59d
      7c0838e6
    • H
      Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573) · e484b81e
      Hui Xiao 提交于
      Summary:
      **Context:**
      Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
      
      This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
      
      The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
      ```
      ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
      ```
      
      ```
      stderr:
      WARNING: prefix_size is non-zero but memtablerep != prefix_hash
      db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
      ```
      
      **Summary:**
      The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
      - Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
      - Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
          -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
      - Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
          - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
      
      Test Plan:
      - `make check`
      - Passed the repro db_stress command
      - For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
      
      Reviewed By: ajkr
      
      Differential Revision: D39005886
      
      Pulled By: hx235
      
      fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
      e484b81e
    • L
      Add a dedicated cache entry role for blobs (#10601) · 78185601
      Levi Tamasi 提交于
      Summary:
      The patch adds a dedicated cache entry role for blob values and switches
      to a registered deleter so that blobs show up as a separate bucket
      (as opposed to "Misc") in the cache occupancy statistics, e.g.
      
      ```
      Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601
      
      Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D39107915
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125
      78185601
    • P
      Don't wait for indirect flush in read-only DB (#10569) · c5afbbfe
      Peter Dillinger 提交于
      Summary:
      Some APIs for getting live files, which are used by Checkpoint
      and BackupEngine, can optionally trigger and wait for a flush. These
      would deadlock when used on a read-only DB. Here we fix that by assuming
      the user wants the overall operation to succeed and is OK without
      flushing (because the DB is read-only).
      
      Follow-up work: the same or other issues can be hit by directly invoking
      some DB functions that are clearly not appropriate for read-only
      instance, but are not covered by overrides in DBImplReadOnly and
      CompactedDBImpl. These should be fixed to avoid similar problems on
      accidental misuse. (Long term, it would be nice to have a DBReadOnly
      class without those members, like BackupEngineReadOnly.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569
      
      Test Plan: tests updated to catch regression (hang before the fix)
      
      Reviewed By: riversand963
      
      Differential Revision: D38995759
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68
      c5afbbfe
  14. 27 8月, 2022 1 次提交
    • L
      Improve the accounting of memory used by cached blobs (#10583) · 23376aa5
      Levi Tamasi 提交于
      Summary:
      The patch improves the bookkeeping around the memory usage of
      cached blobs in two ways: 1) it uses `malloc_usable_size`, which accounts
      for allocator bin sizes etc., and 2) it also considers the memory usage
      of the `BlobContents` object in addition to the blob itself. Note: some unit
      tests had been relying on the cache charge being equal to the size of the
      cached blob; these were updated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10583
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D39060680
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 3583adce2b4ce6e84861f3fadccbfd2e5a3cc482
      23376aa5
  15. 26 8月, 2022 3 次提交
    • J
      Fix periodic_task unable to re-register the same task type (#10379) · d9e71fb2
      Jay Zhuang 提交于
      Summary:
      Timer has a limitation that it cannot re-register a task with the same name,
      because the cancel only mark the task as invalid and wait for the Timer thread
      to clean it up later, before the task is cleaned up, the same task name cannot
      be added. Which makes the task option update likely to fail, which basically
      cancel and re-register the same task name. Change the periodic task name to a
      random unique id and store it in periodic_task_scheduler.
      
      Also refactor the `periodic_work` to `periodic_task` to make each job function
      as a `task`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379
      
      Test Plan: unittests
      
      Reviewed By: ajkr
      
      Differential Revision: D38000615
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e
      d9e71fb2
    • L
      Introduce a dedicated class to represent blob values (#10571) · 3f57d84a
      Levi Tamasi 提交于
      Summary:
      The patch introduces a new class called `BlobContents`, which represents
      a single uncompressed blob value. We currently use `std::string` for this
      purpose; `BlobContents` is somewhat smaller but the primary reason for a
      dedicated class is that it enables certain improvements and optimizations
      like eliding a copy when inserting a blob into the cache, using custom
      allocators, or more control over and better accounting of the memory usage
      of cached blobs (see https://github.com/facebook/rocksdb/issues/10484).
      (We plan to implement these in subsequent PRs.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D39000965
      
      Pulled By: ltamasi
      
      fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96
      3f57d84a
    • A
      Ensure writes to WAL tail during `FlushWAL(true /* sync */)` will be synced (#10560) · 7ad4b386
      Andrew Kryczka 提交于
      Summary:
      WAL append and switch can both happen between `FlushWAL(true /* sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it.
      
      Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560
      
      Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found
      
      Reviewed By: riversand963
      
      Differential Revision: D38957391
      
      Pulled By: ajkr
      
      fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26
      7ad4b386
  16. 25 8月, 2022 2 次提交
  17. 24 8月, 2022 2 次提交
  18. 20 8月, 2022 2 次提交
    • A
      MultiGet async IO across multiple levels (#10535) · 35cdd3e7
      anand76 提交于
      Summary:
      This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
      
      Test Plan:
      1. Ensure existing MultiGet unit tests pass, updating them as necessary
      2. New unit tests - TODO
      3. Run stress test - TODO
      
      No noticeable regression (<1%) without async IO -
      Without PR: `multireadrandom :       7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations;  571.6 MB/s (8168992 of 8168992 found)`
      With PR: `multireadrandom :       7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations;  568.2 MB/s (8271992 of 8271992 found)`
      
      For a fully cached DB, but with async IO option on, no regression observed (<1%) -
      Without PR: `multireadrandom :       5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations;  797.9 MB/s (11540992 of 11540992 found) `
      With PR: `multireadrandom :       5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations;  790.7 MB/s (11649992 of 11649992 found) `
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38774009
      
      Pulled By: anand1976
      
      fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
      35cdd3e7
    • L
      Add support for wide-column point lookups (#10540) · 81388b36
      Levi Tamasi 提交于
      Summary:
      The patch adds a new API `GetEntity` that can be used to perform
      wide-column point lookups. It also extends the `Get` code path and
      the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
      accordingly so that wide-column entities can be served from both
      memtables and SSTs. If the result of a lookup is a wide-column entity
      (`kTypeWideColumnEntity`), it is passed to the application in deserialized
      form; if it is a plain old key-value (`kTypeValue`), it is presented as a
      wide-column entity with a single default (anonymous) column.
      (In contrast, regular `Get` returns plain old key-values as-is, and
      returns the value of the default column for wide-column entities, see
      https://github.com/facebook/rocksdb/issues/10483 .)
      
      The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
      `PinnableWideColumns` contains a `PinnableSlice`, which either stores the
      underlying data in its own buffer or holds on to a cache handle. It also contains
      a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
      so applications can access the values of columns efficiently.
      
      There are several pieces of functionality which are currently not supported
      for wide-column entities: there is currently no `MultiGetEntity` or wide-column
      iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
      is no `GetEntity` implementation for read-only and secondary instances.
      We plan to implement these in future PRs.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38847474
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
      81388b36
  19. 18 8月, 2022 2 次提交
    • A
      Prevent a case of WriteBufferManager flush thrashing (#6364) · 91166012
      Andrew Kryczka 提交于
      Summary:
      Previously, the flushes triggered by `WriteBufferManager` could affect
      the same CF repeatedly if it happens to get consecutive writes. Such
      flushes are not particularly useful for reducing memory usage since
      they switch nearly-empty memtables to immutable while they've just begun
      filling their first arena block. In fact they may not even reduce the
      mutable memory count if they involve replacing one mutable memtable containing
      one arena block with a new mutable memtable containing one arena block.
      Further, if such switches happen even a few times before a flush finishes,
      the immutable memtable limit will be reached and writes will stall.
      
      This PR adds a heuristic to not switch memtables to immutable for CFs
      that already have one or more immutable memtables awaiting flush. There
      is a memory usage regression if the user continues writing to the same
      CF, that DB does not have any CFs eligible for switching, flushes
      are not finishing, and the `WriteBufferManager` was constructed with
      `allow_stall=false`. Before, it would grow by switching nearly empty
      memtables until writes stall. Now, it would grow by filling memtables
      until writes stall. This feels like an acceptable behavior change because
      users who prefer to stall over violate the memory limit should be using
      `allow_stall=true`, which is unaffected by this PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6364
      
      Test Plan:
      - Command:
      
      `rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num_multi_db=8 -num_column_families=2 -write_buffer_size=4194304 -db_write_buffer_size=16777216 -compression_type=none -statistics=true -target_file_size_base=4194304 -max_bytes_for_level_base=16777216`
      
      - `rocksdb.db.write.stall` count before this PR: 175
      - `rocksdb.db.write.stall` count after this PR: 0
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D20167197
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4a64064e9bc33d57c0a35f15547542d0191d0cb7
      91166012
    • A
      Fix range deletion handling in async MultiGet (#10534) · 65814a4a
      anand76 提交于
      Summary:
      The fix in https://github.com/facebook/rocksdb/issues/10513 was not complete w.r.t range deletion handling. It didn't handle the case where a file with a range tombstone covering a key also overlapped another key in the batch. In that case, ```mget_range``` would be non-empty. However, ```mget_range``` would only have the second key and, therefore, the first key would be skipped when iterating through the range tombstones in ```TableCache::MultiGet```.
      
      Test plan -
      1. Add a unit test
      2. Run stress tests
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10534
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38773880
      
      Pulled By: anand1976
      
      fbshipit-source-id: dae491dbe52e18bbce5179b77b63f20771a66c00
      65814a4a