1. 23 3月, 2022 2 次提交
    • A
      Add async_io read option in db_bench (#9735) · f07eec1b
      Akanksha Mahajan 提交于
      Summary:
      Add async_io Read option in db_bench
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9735
      
      Test Plan:
      ./db_bench -use_existing_db=true
      -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32
      -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680
      -duration=120 -ops_between_duration_checks=1 -async_io=1
      
      Reviewed By: riversand963
      
      Differential Revision: D35058482
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 1522b638c79f6d85bb7408c67f6ab76dbabeeee7
      f07eec1b
    • M
      For db_bench --benchmarks=fillseq with --num_multi_db load databases … (#9713) · 63a284a6
      Mark Callaghan 提交于
      Summary:
      …in order
      
      This fixes https://github.com/facebook/rocksdb/issues/9650
      For db_bench --benchmarks=fillseq --num_multi_db=X it loads databases in sequence
      rather than randomly choosing a database per Put. The benefits are:
      1) avoids long delays between flushing memtables
      2) avoids flushing memtables for all of them at the same point in time
      3) puts same number of keys per database so that query tests will find keys as expected
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9713
      
      Test Plan:
      Using db_bench.1 without the change and db_bench.2 with the change:
      
      for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq --num_multi_db=4 --num=10000000; du -hs /data/m/rx ; done
      
       --- without the change
          fillseq      :       3.188 micros/op 313682 ops/sec;   34.7 MB/s
          real    2m7.787s
          user    1m52.776s
          sys     0m46.549s
          2.7G    /data/m/rx
      
       --- with the change
      
          fillseq      :       3.149 micros/op 317563 ops/sec;   35.1 MB/s
          real    2m6.196s
          user    1m51.482s
          sys     0m46.003s
          2.7G    /data/m/rx
      
          Also, temporarily added a printf to confirm that the code switches to the next database at the right time
          ZZ switch to db 1 at 10000000
          ZZ switch to db 2 at 20000000
          ZZ switch to db 3 at 30000000
      
      for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq,readrandom --num_multi_db=4 --num=100000; du -hs /data/m/rx ; done
      
       --- without the change, smaller database, note that not all keys are found by readrandom because databases have < and > --num keys
      
          fillseq      :       3.176 micros/op 314805 ops/sec;   34.8 MB/s
          readrandom   :       1.913 micros/op 522616 ops/sec;   57.7 MB/s (99873 of 100000 found)
      
       --- with the change, smaller database, note that all keys are found by readrandom
      
          fillseq      :       3.110 micros/op 321566 ops/sec;   35.6 MB/s
          readrandom   :       1.714 micros/op 583257 ops/sec;   64.5 MB/s (100000 of 100000 found)
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35030168
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 2a18c4ec571d954cf5a57b00a11802a3608823ee
      63a284a6
  2. 22 3月, 2022 1 次提交
    • M
      Make mixgraph easier to use (#9711) · 1ca1562e
      Mark Callaghan 提交于
      Summary:
      Changes:
      * improves monitoring by displaying average size of a Put value and average scan length
      * forces the minimum value size to be 10. Before this it was 0 if you didn't set the distribution parameters.
      * uses reasonable defaults for the distribution parameters that determine value size and scan length
      * includes seeks in "reads ... found" message, before this they were missing
      
      This is for https://github.com/facebook/rocksdb/issues/9672
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9711
      
      Test Plan:
      Before this change:
      
      ./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
      fillseq      :       4.289 micros/op 233138 ops/sec;   25.8 MB/s
      mixgraph     :      18.461 micros/op 54166 ops/sec;  755.0 MB/s ( Gets:50164 Puts:24919 Seek:24917 of 50164 in 75081 found)
      
      After this change:
      
      ./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
      fillseq      :       3.974 micros/op 251553 ops/sec;   27.8 MB/s
      mixgraph     :      16.722 micros/op 59795 ops/sec;  833.5 MB/s ( Gets:50164 Puts:24919 Seek:24917, reads 75081 in 75081 found, avg size: 36.0 value, 504.9 scan)
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35030190
      
      Pulled By: mdcallag
      
      fbshipit-source-id: d8f555f28d869f752ddb674a524108884511b151
      1ca1562e
  3. 09 3月, 2022 1 次提交
    • H
      Rate-limit automatic WAL flush after each user write (#9607) · ca0ef54f
      Hui Xiao 提交于
      Summary:
      **Context:**
      WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.
      
      Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as  `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes),  for the benefits of:
      - being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
      - being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)
      
      `WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
      - The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL`  right now before improving that scheduling.
      
      A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.
      
      **Summary:**
      - Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
      - Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
      - Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607
      
      Test Plan:
      - Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
      - Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
      - `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
         - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
         - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
      - crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0  --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal
      
      **Benchmarked on flush/compaction to ensure no performance regression:**
      - compaction with rate-limiting  (see table 1, avg over 1280-run):  pre-change: **915635 micros/op**; post-change:
         **907350 micros/op (improved by 0.106%)**
      ```
      #!/bin/bash
      TEST_TMPDIR=/dev/shm/testdb
      START=1
      NUM_DATA_ENTRY=8
      N=10
      
      rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
      for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
      do
          NUM_RUN=$(($N*(2**($i-1))))
          for j in $(eval echo "{$START..$NUM_RUN}")
          do
             ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
          done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
      done
      ```
      - compaction w/o rate-limiting  (see table 2, avg over 640-run):  pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
      ```
      Same as above script, except that -rate_limiter_bytes_per_sec=0
      ```
      - flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](https://github.com/hx235/rocksdb/commit/ee5c6023a9f6533fab9afdc681568daa21da4953) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
      ```
       #!/bin/bash
      TEST_TMPDIR=/dev/shm/testdb
      START=1
      NUM_DATA_ENTRY=8
      N=10
      
      rm -f flush_bmk_output.txt flush_bmk_output_2.txt
      
      for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
      do
          NUM_RUN=$(($N*(2**($i-1))))
          for j in $(eval echo "{$START..$NUM_RUN}")
          do
             ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
          done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
      done
      
      ```
      - flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](https://github.com/hx235/rocksdb/commit/ee5c6023a9f6533fab9afdc681568daa21da4953) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
      ```
      Same as above script, except that -rate_limiter_bytes_per_sec=0
      ```
      
      | table 1 - compact with rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
      -- | -- | -- | -- | -- | --
      10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
      20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
      40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
      80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
      160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
      3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
      6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
      **1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
      average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682
      
      | table 2 - compact w/o rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
      -- | -- | -- | -- | -- | --
      10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
      20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
      40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
      80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
      160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
      3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
      **6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
      average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 |  -0.25
      
      | table 3 - flush with rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op  (%)
      -- | -- | -- | -- | -- | --
      10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
      20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
      40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
      80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
      160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
      **3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
      6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
      average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238
      
      | table 4 - flush w/o rate-limiting|
      #-run | (pre-change) avg micros/op | std micros/op | (post-change)  avg micros/op | std micros/op | change in avg micros/op (%)
      -- | -- | -- | -- | -- | --
      10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
      20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
      40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
      80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
      160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
      **3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
      6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
      average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00
      
      Reviewed By: ajkr
      
      Differential Revision: D34442441
      
      Pulled By: hx235
      
      fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
      ca0ef54f
  4. 24 2月, 2022 1 次提交
    • B
      Add a secondary cache implementation based on LRUCache 1 (#9518) · f706a9c1
      Bo Wang 提交于
      Summary:
      **Summary:**
      RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.
      
      This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518
      
      Test Plan:
      In this PR, the lru_secondary_cache_test.cc includes the following tests:
      1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
      2. The integration tests with both primary cache and this secondary cache .
      
      **Follow Up:**
      
      1. Statistics (e.g. compression ratio) will be added in another PR.
      2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.
      
      Reviewed By: anand1976
      
      Differential Revision: D34430930
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded
      f706a9c1
  5. 18 2月, 2022 1 次提交
    • S
      Add record to set WAL compression type if enabled (#9556) · 39b0d921
      Siddhartha Roychowdhury 提交于
      Summary:
      When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
      Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
      
      Reviewed By: anand1976
      
      Differential Revision: D34308216
      
      Pulled By: sidroyc
      
      fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
      39b0d921
  6. 17 2月, 2022 1 次提交
    • A
      Add rate limiter priority to ReadOptions (#9424) · babe56dd
      Andrew Kryczka 提交于
      Summary:
      Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
      
      `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
      
      There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
      
      The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
      
      Test Plan:
      - new unit tests
      - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
        - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
        - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
      - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
      
      Reviewed By: hx235
      
      Differential Revision: D33747386
      
      Pulled By: ajkr
      
      fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
      babe56dd
  7. 12 2月, 2022 1 次提交
    • P
      Hide deprecated, inefficient block-based filter from public API (#9535) · 479eb1aa
      Peter Dillinger 提交于
      Summary:
      This change removes the ability to configure the deprecated,
      inefficient block-based filter in the public API. Options that would
      have enabled it now use "full" (and optionally partitioned) filters.
      Existing block-based filters can still be read and used, and a "back
      door" way to build them still exists, for testing and in case of trouble.
      
      About the only way this removal would cause an issue for users is if
      temporary memory for filter construction greatly increases. In
      HISTORY.md we suggest a few possible mitigations: partitioned filters,
      smaller SST files, or setting reserve_table_builder_memory=true.
      
      Or users who have customized a FilterPolicy using the
      CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
      their code. (It's long past time for people to move to the new
      builder/reader customization interface.)
      
      This change also introduces some internal-use-only configuration strings
      for testing specific filter implementations while bypassing some
      compatibility / intelligence logic. This is intended to hint at a path
      toward making FilterPolicy Customizable, but it also gives us a "back
      door" way to configure block-based filter.
      
      Aside: updated db_bench so that -readonly implies -use_existing_db
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535
      
      Test Plan:
      Unit tests updated. Specifically,
      
      * BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
      door configuration interface and ignoring of `use_block_based_builder`.
      * BlockBasedTableTest.TracingGetTest is migrated from testing
      block-based filter access pattern to full filter access patter, by
      re-ordering some things.
      * Options test (pretty self-explanatory)
      
      Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`
      
      Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
      With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
      No consistent difference with fillrandom
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34153871
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
      479eb1aa
  8. 09 2月, 2022 2 次提交
  9. 04 2月, 2022 1 次提交
    • M
      Introduce a CountedFileSystem for counting file operations (#9283) · aae30937
      mrambacher 提交于
      Summary:
      Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes).    This class was based on the ReportFileOpEnv from db_bench.
      
      This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283
      
      Reviewed By: pdillinger
      
      Differential Revision: D33062004
      
      Pulled By: mrambacher
      
      fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
      aae30937
  10. 02 2月, 2022 1 次提交
    • Y
      Revise APIs related to user-defined timestamp (#8946) · 3122cb43
      Yanqin Jin 提交于
      Summary:
      ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
      Namely, `WriteOptions` should not include information about "what-to-write", but should just
      include information about "how-to-write".
      
      According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
      this PR removes `WriteOptions::timestamp` for compliance.
      After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
      of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
      `SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
      made me reconsider doing it in another PR (maybe).
      
      For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
      extra `timestamp` information when writing to `WriteBatch`es.
      These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
      
      Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
      `WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
      allocated already and multiple timestamps can be updated.
      
      The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
      size of the default column family. This will be used to allocate space when calling APIs that do not
      specify a column family handle.
      
      Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
      some assertions about timestamp to returning Status code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
      
      Test Plan:
      make check
      ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
      ./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
      
      Make sure there is no perf regression by running the following
      ```
      ./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
      ```
      
      Before this PR
      ```
      DB path: [/dev/shm/rocksdb]
      fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s
      ```
      After this PR
      ```
      DB path: [/dev/shm/rocksdb]
      fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D33721359
      
      Pulled By: riversand963
      
      fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
      3122cb43
  11. 01 2月, 2022 1 次提交
    • P
      Ignore `total_order_seek` in DB::Get (#9427) · f6d7ec1d
      Peter Dillinger 提交于
      Summary:
      Apparently setting total_order_seek=true for DB::Get was
      intended to allow accurate read semantics if the current prefix
      extractor doesn't match what was used to generate SST files on
      disk. But since prefix_extractor was made a mutable option in 5.14.0, we
      have been able to detect this case and provide the correct semantics
      regardless of the total_order_seek option. Since that time, the option
      has only made Get() slower in a reasonably common case: prefix_extractor
      unchanged and whole_key_filtering=false.
      
      So this change primarily removes unnecessary effect of
      total_order_seek on Get. Also cleans up some related comments.
      
      Also adds a -total_order_seek option to db_bench and canonicalizes
      handling of ReadOptions in db_bench so that command line options have
      the expected association with library features. (There is potential
      for change in regression test behavior, but the old behavior is likely
      indefensible, or some other inconsistency would need to be fixed.)
      
      TODO in follow-up work: there should be no reason for Get() to depend on
      current prefix extractor at all.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
      
      Test Plan:
      Unit tests updated.
      
      Performance (using db_bench update)
      
      Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
      
      Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
      
      Before this change, total_order_seek=false: 25188 ops/sec
      Before this change, total_order_seek=true:   1222 ops/sec (~20x slower)
      
      After this change, total_order_seek=false: 24570 ops/sec
      After this change, total_order_seek=true:  25012 ops/sec (indistinguishable)
      
      Reviewed By: siying
      
      Differential Revision: D33753458
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
      f6d7ec1d
  12. 29 1月, 2022 1 次提交
    • H
      Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds (#9455) · 42cca28e
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
      - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
      - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
      
      Test Plan: Rely on my eyeball and CI
      
      Reviewed By: ajkr
      
      Differential Revision: D33811664
      
      Pulled By: hx235
      
      fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
      42cca28e
  13. 28 1月, 2022 2 次提交
    • J
      Remove unused API base_background_compactions (#9462) · 22321e10
      Jay Zhuang 提交于
      Summary:
      The API is deprecated long time ago. Clean up the codebase by
      removing it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462
      
      Test Plan: CI, fake release: D33835220
      
      Reviewed By: riversand963
      
      Differential Revision: D33835103
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
      22321e10
    • H
      Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452) · 1e0e883c
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
      - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
      - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452
      
      Test Plan: Rely on my eyeball and CI
      
      Reviewed By: ajkr
      
      Differential Revision: D33804938
      
      Pulled By: hx235
      
      fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
      1e0e883c
  14. 27 1月, 2022 1 次提交
  15. 25 1月, 2022 1 次提交
  16. 19 1月, 2022 1 次提交
  17. 05 1月, 2022 1 次提交
  18. 30 12月, 2021 1 次提交
    • S
      Improve SimulatedHybridFileSystem (#9301) · a931bacf
      sdong 提交于
      Summary:
      Several improvements to SimulatedHybridFileSystem:
      (1) Allow a mode where all I/Os to all files simulate HDD. This can be enabled in db_bench using -simulate_hdd
      (2) Latency calculation is slightly more accurate
      (3) Allow to simulate more than one HDD spindles.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9301
      
      Test Plan: Run db_bench and observe the results are reasonable.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D33141662
      
      fbshipit-source-id: b736e58c4ba910d06899cc9ccec79b628275f4fa
      a931bacf
  19. 30 11月, 2021 1 次提交
    • Y
      Fix build for msvc (#9230) · 42fef022
      Yanqin Jin 提交于
      Summary:
      Test plan
      
      With Visual Studio 2017.
      ```
      cd rocksdb
      mkdir build && cd build
      cmake -G "Visual Studio 15 Win64" -DWITH_GFLAGS=1 ..
      MSBuild rocksdb.sln /m /TARGET:cache_bench /TARGET:db_bench /TARGET:db_stress
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9230
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D32705095
      
      Pulled By: riversand963
      
      fbshipit-source-id: 101e3533f5178b24c0535ddc47a39347ccfcf92c
      42fef022
  20. 20 11月, 2021 1 次提交
    • L
      Support readahead during compaction for blob files (#9187) · dc5de45a
      Levi Tamasi 提交于
      Summary:
      The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
      that can be used to enable prefetching data from blob files during compaction.
      This is important when using storage with higher latencies like HDDs or remote filesystems.
      If enabled, prefetching is used for all cases when blobs are read during compaction,
      namely garbage collection, compaction filters (when the existing value has to be read from
      a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
      
      Test Plan: Ran `make check` and the stress/crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D32565512
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
      dc5de45a
  21. 11 11月, 2021 1 次提交
    • A
      Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056) · 17ce1ca4
      Akanksha Mahajan 提交于
      Summary:
      RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
      
      This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size  at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
      
      1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer),  readahead_size will decrease by 8KB.
      2. It maintains readahead_size for L1 - Ln levels.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
      
      Test Plan:
      Added new unit tests
      Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
      
      Reviewed By: anand1976
      
      Differential Revision: D31773640
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
      17ce1ca4
  22. 05 11月, 2021 1 次提交
  23. 02 11月, 2021 1 次提交
    • S
      Try to start TTL earlier with kMinOverlappingRatio is used (#8749) · a2b9be42
      sdong 提交于
      Summary:
      Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.
      
      When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.
      
      In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749
      
      Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D30735261
      
      fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
      a2b9be42
  24. 29 10月, 2021 1 次提交
    • P
      Implement XXH3 block checksum type (#9069) · a7d4bea4
      Peter Dillinger 提交于
      Summary:
      XXH3 - latest hash function that is extremely fast on large
      data, easily faster than crc32c on most any x86_64 hardware. In
      integrating this hash function, I have handled the compression type byte
      in a non-standard way to avoid using the streaming API (extra data
      movement and active code size because of hash function complexity). This
      approach got a thumbs-up from Yann Collet.
      
      Existing functionality change:
      * reject bad ChecksumType in options with InvalidArgument
      
      This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
      likely to be handled through different configuration than ChecksumType.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
      
      Test Plan:
      tests updated, and substantially expanded. Unit tests now check
      that we don't accidentally change the values generated by the checksum
      algorithms ("schema test") and that we properly handle
      invalid/unrecognized checksum types in options or in file footer.
      
      DBTestBase::ChangeOptions (etc.) updated from two to one configuration
      changing from default CRC32c ChecksumType. The point of this test code
      is to detect possible interactions among features, and the likelihood of
      some bad interaction being detected by including configurations other
      than XXH3 and CRC32c--and then not detected by stress/crash test--is
      extremely low.
      
      Stress/crash test also updated (manual run long enough to see it accepts
      new checksum type). db_bench also updated for microbenchmarking
      checksums.
      
       ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
      crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
      xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
      xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
      xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
      crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
      xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
      xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
      xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
      crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
      xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
      xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
      xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
      
      As you can see, especially once warmed up, xxh3 is fastest.
      
       ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      Test
      
          for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
      
      Results (ops/sec)
      
          for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
      
      results-0 252118 # kNoChecksum
      results-1 251588 # kCRC32c
      results-2 251863 # kxxHash
      results-3 252016 # kxxHash64
      results-4 252038 # kXXH3
      
      Reviewed By: mrambacher
      
      Differential Revision: D31905249
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
      a7d4bea4
  25. 21 10月, 2021 1 次提交
    • S
      Incremental Space Amp Compactions in Universal Style (#8655) · c66b4429
      sdong 提交于
      Summary:
      This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
      
      In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
      
      Two set of write benchmarks are run:
      1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
      2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
      
      Test Plan: Add unit tests to the functionality.
      
      Reviewed By: ajkr
      
      Differential Revision: D31787034
      
      fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
      c66b4429
  26. 15 10月, 2021 1 次提交
  27. 12 10月, 2021 1 次提交
    • L
      Make it possible to force the garbage collection of the oldest blob files (#8994) · 3e1bf771
      Levi Tamasi 提交于
      Summary:
      The current BlobDB garbage collection logic works by relocating the valid
      blobs from the oldest blob files as they are encountered during compaction,
      and cleaning up blob files once they contain nothing but garbage. However,
      with sufficiently skewed workloads, it is theoretically possible to end up in a
      situation when few or no compactions get scheduled for the SST files that contain
      references to the oldest blob files, which can lead to increased space amp due
      to the lack of GC.
      
      In order to efficiently handle such workloads, the patch adds a new BlobDB
      configuration option called `blob_garbage_collection_force_threshold`,
      which signals to BlobDB to schedule targeted compactions for the SST files
      that keep alive the oldest batch of blob files if the overall ratio of garbage in
      the given blob files meets the threshold *and* all the given blob files are
      eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
      if the new option is set to 0.9, targeted compactions will get scheduled if the
      sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
      oldest blob files, assuming all affected blob files are below the age-based cutoff.)
      The net result of these targeted compactions is that the valid blobs in the oldest
      blob files are relocated and the oldest blob files themselves cleaned up (since
      *all* SST files that rely on them get compacted away).
      
      These targeted compactions are similar to periodic compactions in the sense
      that they force certain SST files that otherwise would not get picked up to undergo
      compaction and also in the sense that instead of merging files from multiple levels,
      they target a single file. (Note: such compactions might still include neighboring files
      from the same level due to the need of having a "clean cut" boundary but they never
      include any files from any other level.)
      
      This functionality is currently only supported with the leveled compaction style
      and is inactive by default (since the default value is set to 1.0, i.e. 100%).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
      
      Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
      
      Reviewed By: riversand963
      
      Differential Revision: D31489850
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
      3e1bf771
  28. 01 10月, 2021 1 次提交
  29. 19 9月, 2021 1 次提交
  30. 18 9月, 2021 1 次提交
  31. 11 9月, 2021 1 次提交
  32. 08 9月, 2021 1 次提交
    • M
      Make MemTableRepFactory into a Customizable class (#8419) · beed8647
      mrambacher 提交于
      Summary:
      This PR does the following:
      -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
      -> Makes the existing implementations compatible with configurations
      -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
      
      New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D29558961
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
      beed8647
  33. 19 8月, 2021 1 次提交
    • M
      Allow Replayer to report the results of TraceRecords. (#8657) · d10801e9
      Merlin Mao 提交于
      Summary:
      `Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
      `Replayer::Replay()` reports the results via a callback function.
      
      New interface:
      `TraceRecordResult` in "rocksdb/trace_record_result.h".
      
      `DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
      
      Reviewed By: ajkr
      
      Differential Revision: D30290216
      
      Pulled By: autopear
      
      fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
      d10801e9
  34. 12 8月, 2021 1 次提交
    • M
      Make TraceRecord and Replayer public (#8611) · f58d2767
      Merlin Mao 提交于
      Summary:
      New public interfaces:
      `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
      `Replayer`, available in `rocksdb/utilities/replayer.h`.
      
      User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
      
      Unit tests:
      - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
      - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
      
      Reviewed By: ajkr
      
      Differential Revision: D30266329
      
      Pulled By: autopear
      
      fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
      f58d2767
  35. 11 8月, 2021 1 次提交
    • B
      Memtable sampling for mempurge heuristic. (#8628) · e3a96c48
      Baptiste Lemaire 提交于
      Summary:
      Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
      This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
      Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
      At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
      Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
      The unit tests have been readapted to support this new API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
      
      Reviewed By: pdillinger
      
      Differential Revision: D30149315
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
      e3a96c48
  36. 10 8月, 2021 2 次提交
    • A
      Simplify GenericRateLimiter algorithm (#8602) · 82b81dc8
      Andrew Kryczka 提交于
      Summary:
      `GenericRateLimiter` slow path handles requests that cannot be satisfied
      immediately.  Such requests enter a queue, and their thread stays in `Request()`
      until they are granted or the rate limiter is stopped.  These threads are
      responsible for unblocking themselves.  The work to do so is split into two main
      duties.
      
      (1) Waiting for the next refill time.
      (2) Refilling the bytes and granting requests.
      
      Prior to this PR, the slow path logic involved a leader election algorithm to
      pick one thread to perform (1) followed by (2).  It elected the thread whose
      request was at the front of the highest priority non-empty queue since that
      request was most likely to be granted.  This algorithm was efficient in terms of
      reducing intermediate wakeups, which is a thread waking up only to resume
      waiting after finding its request is not granted.  However, the conceptual
      complexity of this algorithm was too high.  It took me a long time to draw a
      timeline to understand how it works for just one edge case yet there were so
      many.
      
      This PR drops the leader election to reduce conceptual complexity.  Now, the two
      duties can be performed by whichever thread acquires the lock first.  The risk
      of this change is increasing the number of intermediate wakeups, however, we
      took steps to mitigate that.
      
      - `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
      prevents the thundering herd problem at the next refill time. The remaining\
      threads wait on their condition variable with an unbounded duration -- thus we\
      must remember to notify them to ensure forward progress.
      - (1) is typically done by a thread at the front of a queue. This is trivial\
      when the queues are initially empty as the first choice that arrives must be\
      the only entry in its queue. When queues are initially non-empty, we achieve\
      this by having (2) notify a thread at the front of a queue (preferring higher\
      priority) to perform the next duty.
      - We do not require any additional wakeup for (2). Typically it will just be\
      done by the thread that finished (1).
      
      Combined, the second and third bullet points above suggest the refill/granting
      will typically be done by a request at the front of its queue.  This is
      important because one wakeup is saved when a granted request happens to be in an
      already running thread.
      
      Note there are a few cases that still lead to intermediate wakeup, however.  The
      first two are existing issues that also apply to the old algorithm, however, the
      third (including both subpoints) is new.
      
      - No request may be granted (only possible when rate limit dynamically\
      decreases).
      - Requests from a different queue may be granted.
      - (2) may be run by a non-front request thread causing it to not be granted even\
      if some requests in that same queue are granted. It can happen for a couple\
      (unlikely) reasons.
        - A new request may sneak in and grab the lock at the refill time, before the\
      thread finishing (1) can wake up and grab it.
        - A new request may sneak in and grab the lock and execute (1) before (2)'s\
      chosen candidate can wake up and grab the lock. Then that non-front request\
      thread performing (1) can carry over to perform (2).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602
      
      Test Plan:
      - Use existing tests. The edge cases listed in the comment are all performance\
      related; I could not really think of any related to correctness. The logic\
      looks the same whether a thread wakes up/finishes its work early/on-time/late,\
      or whether the thread is chosen vs. "steals" the work.
      - Verified write throughput and CPU overhead are basically the same with and\
        without this change, even in a rate limiter heavy workload:
      
      Test command:
      ```
      $ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
      ```
      
      Results before this PR:
      
      ```
      fillrandom   :     108.463 micros/op 9219 ops/sec;    9.0 MB/s
      7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
      ```
      
      Results after this PR:
      
      ```
      fillrandom   :     108.108 micros/op 9250 ops/sec;    9.0 MB/s
      7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
      ```
      
      Reviewed By: hx235
      
      Differential Revision: D30048013
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
      82b81dc8
    • S
      Move old files to warm tier in FIFO compactions (#8310) · e7c24168
      sdong 提交于
      Summary:
      Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
      
      Test Plan: Add several unit tests.
      
      Reviewed By: ajkr
      
      Differential Revision: D28493792
      
      fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
      e7c24168