1. 29 9月, 2022 1 次提交
  2. 28 9月, 2022 3 次提交
  3. 27 9月, 2022 5 次提交
    • C
      Fix segfault in Iterator::Refresh() (#10739) · df492791
      Changyu Bi 提交于
      Summary:
      when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739
      
      Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue.
      
      Reviewed By: ajkr, riversand963
      
      Differential Revision: D39825283
      
      Pulled By: cbi42
      
      fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c
      df492791
    • H
      Support WriteCommit policy with sync_fault_injection=1 (#10624) · aed30ddf
      Hui Xiao 提交于
      Summary:
      **Context:**
      Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`).
      
      **Summary:**
      They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.
      
      For (a)
      - During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`.
      - On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`.
      - During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState`
      - During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map.
      
      For (b) - one major issue described below:
      - TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](https://github.com/facebook/rocksdb/commit/6d06be22c083ccf185fd38dba49fde73b644b4c1) predated correctness testing.
      - And we will verify those processed keys in a recovered db against their expected state.
      - However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case.
      - Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624
      
      Test Plan:
      1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after
      ```
      db=/dev/shm/rocksdb_crashtest_blackbox
      exp=/dev/shm/rocksdb_crashtest_expected
      dbt=$db.tmp
      expt=$exp.tmp
      
      rm -rf $db $exp
      mkdir -p $exp
      
      echo "RUN 1"
      ./db_stress \
      --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
      --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
      pid=$!
      sleep 0.2
      sleep 20
      kill $pid
      sleep 0.2
      
      echo "RUN 2"
      ./db_stress \
      --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
      --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
      pid=$!
      sleep 0.2
      sleep 20
      kill $pid
      sleep 0.2
      
      echo "RUN 3"
      ./db_stress \
      --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
      --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
      ```
      
      2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL.
         - Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file
      ```
      db=/dev/shm/rocksdb_crashtest_blackbox
      exp=/dev/shm/rocksdb_crashtest_expected
      rm -rf $db $exp
      mkdir -p $exp
      
      ./db_stress \
      	--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
      	--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
      pid=$!
      sleep 30
      kill $pid
      sleep 1
      ```
      - Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker)
      ```
         ./db_stress \
      	--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
      	--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
      ```
      3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure.
      
      Reviewed By: ajkr, riversand963
      
      Differential Revision: D39199373
      
      Pulled By: hx235
      
      fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c
      aed30ddf
    • A
      Add OpenSSL to docker image (#10741) · 5d7cf311
      anand76 提交于
      Summary:
      Update the docker image with OpenSSL, required by the folly build.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10741
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D39831081
      
      Pulled By: anand1976
      
      fbshipit-source-id: 900154f70a456d1b6f9e384b8bdbcc227af4adbc
      5d7cf311
    • Y
      Update HISTORY to mention PR #10724 (#10737) · 52f24117
      Yanqin Jin 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10737
      
      Reviewed By: cbi42
      
      Differential Revision: D39825386
      
      Pulled By: riversand963
      
      fbshipit-source-id: a3c55f2777e034d6ae6ff44ef0219d9fbbf1cc96
      52f24117
    • L
      Small cleanup in NonBatchedOpsStressTest::VerifyDb (#10740) · 2280b261
      Levi Tamasi 提交于
      Summary:
      The PR cleans up the logic in `NonBatchedOpsStressTest::VerifyDb` so that
      the verification method is picked using a single random number generation.
      It also eliminates some repeated key comparisons and makes some small
      code hygiene improvements.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10740
      
      Test Plan: Ran a simple blackbox crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D39828646
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 60ee5a3bb1851278f62c7d83b0c93b902ed9702e
      2280b261
  4. 24 9月, 2022 2 次提交
  5. 23 9月, 2022 5 次提交
  6. 22 9月, 2022 7 次提交
  7. 21 9月, 2022 1 次提交
  8. 20 9月, 2022 3 次提交
  9. 19 9月, 2022 1 次提交
  10. 17 9月, 2022 3 次提交
    • G
      Add enable_split_merge option for CompressedSecondaryCache (#10690) · 2cc5b395
      gitbw95 提交于
      Summary:
      `enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690
      
      Test Plan:
      Unit Tests
      Stress Tests
      
      Reviewed By: anand1976
      
      Differential Revision: D39567604
      
      Pulled By: gitbw95
      
      fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
      2cc5b395
    • A
      Fix an incorrect MultiGet assertion (#10695) · e053ccde
      anand76 提交于
      Summary:
      The assertion in ```FilePickerMultiGet::ReplaceRange()``` was incorrect. The function should only be called to replace the range after finishing the search in the current level, which is indicated by ```hit_file_ == nullptr``` i.e no more overlapping files in this level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10695
      
      Reviewed By: gitbw95
      
      Differential Revision: D39583217
      
      Pulled By: anand1976
      
      fbshipit-source-id: d4cedfb2b62fb9f3a083e9848a403ae6342f0519
      e053ccde
    • P
      Call experimental new clock cache HyperClockCache (#10684) · 0f91c72a
      Peter Dillinger 提交于
      Summary:
      This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
      * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
      * We want to highlight the key feature: it's fast (especially under parallel load)
      * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
      * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
      
      Some API detail:
      * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
      * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
      
      For performance tests see https://github.com/facebook/rocksdb/pull/10626
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
      
      Test Plan: no interesting functional changes; tests updated
      
      Reviewed By: anand1976
      
      Differential Revision: D39547800
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
      0f91c72a
  11. 16 9月, 2022 7 次提交
    • P
      Revamp, optimize new experimental clock cache (#10626) · 57243486
      Peter Dillinger 提交于
      Summary:
      * Consolidates most metadata into a single word per slot so that more
      can be accomplished with a single atomic update. In the common case,
      Lookup was previously about 4 atomic updates, now just 1 atomic update.
      Common case Release was previously 1 atomic read + 1 atomic update,
      now just 1 atomic update.
      * Eliminate spins / waits / yields, which likely threaten some "lock free"
      benefits. Compare-exchange loops are only used in explicit Erase, and
      strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
      exchange.
      * Relaxes some aggressiveness and guarantees. For example,
        * Duplicate Inserts will sometimes go undetected and the shadow duplicate
          will age out with eviction.
        * In many cases, the older Inserted value for a given cache key will be kept
        (i.e. Insert does not support overwrite).
        * Entries explicitly erased (rather than evicted) might not be freed
        immediately in some rare cases.
        * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
        precisely as LRUCache, but is self-correcting and should only deviate by a
        very small number of extra or fewer entries.
      * Use smaller "computed default" number of cache shards in many cases,
      because benefits to larger usage tracking / eviction pools outweigh the small
      cost of more lock-free atomic contention. The improvement in CPU and I/O
      is dramatic in some limit-memory cases.
      * Even without the sharding change, the eviction algorithm is likely more
      effective than LRU overall because it's more stateful, even though the
      "hot path" state tracking for it is essentially free with ref counting. It
      is like a generalized CLOCK with aging (see code comments). I don't have
      performance numbers showing a specific improvement, but in theory, for a
      Poisson access pattern to each block, keeping some state allows better
      estimation of time to next access (Poisson interval) than strict LRU. The
      bounded randomness in CLOCK can also reduce "cliff" effect for repeated
      range scans approaching and exceeding cache size.
      
      ## Hot path algorithm comparison
      Rough descriptions, focusing on number and kind of atomic operations:
      * Old `Lookup()` (2-5 atomic updates per probe):
      ```
      Loop:
        Increment internal ref count at slot
        If possible hit:
          Check flags atomic (and non-atomic fields)
          If cache hit:
            Three distinct updates to 'flags' atomic
            Increment refs for internal-to-external
            Return
        Decrement internal ref count
      while atomic read 'displacements' > 0
      ```
      * New `Lookup()` (1-2 atomic updates per probe):
      ```
      Loop:
        Increment acquire counter in meta word (optimistic)
        If visible entry (already read meta word):
          If match (read non-atomic fields):
            Return
          Else:
            Decrement acquire counter in meta word
        Else if invisible entry (rare, already read meta word):
          Decrement acquire counter in meta word
      while atomic read 'displacements' > 0
      ```
      * Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
      ```
      Read atomic ref count
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      Else:
        Decrement ref count
      ```
      * New `Release()` (1 unconditional atomic update, rarely more):
      ```
      Increment release counter in meta word
      If last reference and invisible (rare):
        Use CAS etc. to remove
        Return
      ```
      
      ## Performance test setup
      Build DB with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
      ```
      Test with
      ```
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
      ```
      Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
      
      base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
      folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
      gt_clock: experimental ClockCache before this change
      new_clock: experimental ClockCache with this change
      
      ## Performance test results
      First test "hot path" read performance, with block cache large enough for whole DB:
      4181MB 1thread base -> kops/s: 47.761
      4181MB 1thread folly -> kops/s: 45.877
      4181MB 1thread gt_clock -> kops/s: 51.092
      4181MB 1thread new_clock -> kops/s: 53.944
      
      4181MB 16thread base -> kops/s: 284.567
      4181MB 16thread folly -> kops/s: 249.015
      4181MB 16thread gt_clock -> kops/s: 743.762
      4181MB 16thread new_clock -> kops/s: 861.821
      
      4181MB 24thread base -> kops/s: 303.415
      4181MB 24thread folly -> kops/s: 266.548
      4181MB 24thread gt_clock -> kops/s: 975.706
      4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
      
      4181MB 32thread base -> kops/s: 311.251
      4181MB 32thread folly -> kops/s: 274.952
      4181MB 32thread gt_clock -> kops/s: 1045.98
      4181MB 32thread new_clock -> kops/s: 1370.38
      
      4181MB 48thread base -> kops/s: 310.504
      4181MB 48thread folly -> kops/s: 268.322
      4181MB 48thread gt_clock -> kops/s: 1195.65
      4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
      
      4181MB 64thread base -> kops/s: 307.839
      4181MB 64thread folly -> kops/s: 272.172
      4181MB 64thread gt_clock -> kops/s: 1204.47
      4181MB 64thread new_clock -> kops/s: 1615.37
      
      4181MB 128thread base -> kops/s: 310.934
      4181MB 128thread folly -> kops/s: 267.468
      4181MB 128thread gt_clock -> kops/s: 1188.75
      4181MB 128thread new_clock -> kops/s: 1595.46
      
      Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
      
      Now test a large block cache with low miss ratio, but some eviction is required:
      1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
      1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
      1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
      1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
      
      1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
      1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
      1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
      1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
      
      1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
      1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
      1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
      1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
      
      610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
      610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
      610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
      610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
      
      610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
      610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
      610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
      610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
      
      610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
      610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
      610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
      610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
      
      The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
      
      Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
      
      233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
      233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
      233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
      233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
      
      233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
      233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
      233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
      233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
      
      233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
      233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
      233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
      233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
      
      89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
      89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
      89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
      89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
      
      89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
      89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
      89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
      89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
      
      89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
      89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
      89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
      89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
      
      ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
      
      Even smaller cache size:
      34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
      34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
      34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
      34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
      
      34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
      34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
      34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
      34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
      
      34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
      34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
      34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
      34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
      
      As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
      
      13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
      13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
      13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
      13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
      
      13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
      13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
      13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
      13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
      
      13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
      13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
      13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
      13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
      
      gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
      
      13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
      13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
      13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
      13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
      
      13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
      13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
      13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
      13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
      
      Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
      
      Reviewed By: anand1976
      
      Differential Revision: D39368406
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
      57243486
    • A
      Fix some MultiGet stats (#10673) · 37b75e13
      anand76 提交于
      Summary:
      The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D39492615
      
      Pulled By: anand1976
      
      fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3
      37b75e13
    • Y
      Re-enable user-defined timestamp and subcompactions (#10689) · 088b9844
      Yanqin Jin 提交于
      Summary:
      Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions
      after https://github.com/facebook/rocksdb/issues/10658.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689
      
      Test Plan:
      Make sure the following succeeds on devserver.
      make crash_test_with_ts
      
      Reviewed By: ltamasi
      
      Differential Revision: D39556558
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149
      088b9844
    • A
      Fix a MultiGet crash (#10688) · c206aebd
      anand76 提交于
      Summary:
      Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688
      
      Reviewed By: gitbw95
      
      Differential Revision: D39556131
      
      Pulled By: anand1976
      
      fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458
      c206aebd
    • A
      move db_stress locking to `StressTest::Test*()` functions (#10678) · 6ce782be
      Andrew Kryczka 提交于
      Summary:
      One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test*()` function is responsible for acquiring and releasing its own locks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678
      
      Reviewed By: hx235
      
      Differential Revision: D39516401
      
      Pulled By: ajkr
      
      fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758
      6ce782be
    • L
      Support JemallocNodumpAllocator for the block/blob cache in db_bench (#10685) · 7dad4852
      Levi Tamasi 提交于
      Summary:
      The patch makes it possible to use the `JemallocNodumpAllocator` with the
      block/blob caches in `db_bench`. In addition to its stated purpose of excluding
      cache contents from core dumps, `JemallocNodumpAllocator` also uses
      a dedicated arena and jemalloc tcaches for cache allocations, which can
      reduce fragmentation and thus memory usage.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10685
      
      Reviewed By: riversand963
      
      Differential Revision: D39552261
      
      Pulled By: ltamasi
      
      fbshipit-source-id: b5c58eab6b7c1baa9a307d9f1248df1d7a77d2b5
      7dad4852
    • B
      Disable PersistentCacheTierTest.BasicTest (#10683) · b418ace3
      Bo Wang 提交于
      Summary:
      Disable this flaky test since PersistentCache is not used.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10683
      
      Test Plan: Unit Tests
      
      Reviewed By: cbi42
      
      Differential Revision: D39545974
      
      Pulled By: gitbw95
      
      fbshipit-source-id: ac53e96f6ba880e7612e325eb5ff22ee2799efed
      b418ace3
  12. 15 9月, 2022 2 次提交