1. 06 8月, 2022 1 次提交
    • C
      Fragment memtable range tombstone in the write path (#10380) · 9d77bf8f
      Changyu Bi 提交于
      Summary:
      - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
      - db_bench is updated to print out the number of range deletions executed if there is any.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
      
      Test Plan:
      - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
      - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.
      
      ```
      single thread:
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
      
      multi_thread
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
      ```
      Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
      Results are averaged over 5 runs.
      
      Single thread result:
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
      | 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
      | 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
      | 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
      | 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |
      
      32-thread result: note that "Max # tombstones" is per thread.
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
      | 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
      | 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
      | 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |
      
      Reviewed By: ajkr
      
      Differential Revision: D37916564
      
      Pulled By: cbi42
      
      fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
      9d77bf8f
  2. 05 8月, 2022 3 次提交
    • B
      Fix data race reported on SetIsInSecondaryCache in LRUCache (#10472) · f28d0c20
      Bo Wang 提交于
      Summary:
      Currently, `SetIsInSecondaryCache` is after `Promote`. After `Promote`, a handle can be accessed and its flags can be set. This causes data race.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10472
      
      Test Plan:
      unit tests
      stress tests
      
      Reviewed By: pdillinger
      
      Differential Revision: D38403991
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0aaa2d2edeaf5bc799fcce605648fe49eb7119c2
      f28d0c20
    • A
      Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb
      anand76 提交于
      Summary:
      This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38245910
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
      bf4532eb
    • Y
      Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456) · 538df26f
      Yanqin Jin 提交于
      Summary:
      Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies
      on a specific interleaving of two background flush threads. We call them
      bg1 and bg2, and assume bg1 starts to install flush results ahead of
      bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also
      enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can
      proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()`
      and needed to commit to the MANIFEST but falls behind bg1, then bg2
      needs to wait for bg1 to finish writing to MANIFEST. This is a circular
      dependency.
      
      Fix this by allowing bg2 to start only after bg1 grabs the chance to
      sync the WAL and commit to MANIFEST.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456
      
      Test Plan:
      1. make check
      
      2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion
      
      Reviewed By: ltamasi
      
      Differential Revision: D38391856
      
      Pulled By: riversand963
      
      fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96
      538df26f
  3. 04 8月, 2022 2 次提交
    • A
      Avoid allocations/copies for large `GetMergeOperands()` results (#10458) · 504fe4de
      Andrew Kryczka 提交于
      Summary:
      This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.
      
      The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458
      
      Test Plan:
      - new DB level test
      - measured benefit/regression in a number of memtable scenarios
      
      Setup command:
      ```
      $ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
      ```
      
      Benchmark command:
      ```
      ./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
      ```
      
      Worst regression is when a key has many tiny operands:
      
      - Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
      - `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)
      
      The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.
      
      Best improvement is when a key has large operands and high concurrency:
      
      - Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
      - `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).
      
      Reviewed By: cbi42
      
      Differential Revision: D38336578
      
      Pulled By: ajkr
      
      fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
      504fe4de
    • Q
      Fix the error path of PLUGIN_ROOT (#10446) · d23752f6
      Qiaolin Yu 提交于
      Summary:
      When we try to use RocksDB with plugins as a third-party library for other databases, the plugin folder cannot be compiled correctly because of the wrong PLUGIN_ROOT variable. So we fix this error to ensure that it works perfectly when the directory of RocksDB is not the root directory.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10446
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38371321
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0801b7b7dfa87751c8332fb52aac569dcdd72b5d
      Co-authored-by: NSuperMT <supertempler@gmail.com>
      d23752f6
  4. 03 8月, 2022 5 次提交
    • V
      increase buffer size in PosixFileSystem::GetAbsolutePath to PATH_MAX (#10413) · 8d664ccb
      Vladimir Kikhtenko 提交于
      Summary:
      RocksDB fails to open database with relative path when length of cwd
      is longer than 256 bytes. This happens due to ERANGE in getcwd call.
      Here we simply increase buffer size to the most common PATH_MAX value.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10413
      
      Reviewed By: riversand963
      
      Differential Revision: D38189254
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8a0d3a78bbe87645499fbf29fb12bd3d04cd4657
      8d664ccb
    • B
      Split cache to minimize internal fragmentation (#10287) · 87b82f28
      Bo Wang 提交于
      Summary:
      ### **Summary:**
      To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.
      
      _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._
      
      To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.
      
      **T1**
      Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
      Values:     100 bytes each (50 bytes after compression)
      Entries:    90000000
      RawSize:    9956.4 MB (estimated)
      FileSize:   5664.8 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T1_3 | 4000 | 4000 |
      | T1_4 | 2000 | 3000 |
      
      Populate the DB:
      ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1
      
      Run read tests with differnt cache setting:
      T1_3:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &
      
      T1_4:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &
      
      For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
      | - | - | - | - | - |
      | allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
      | available (MB)  | 8753 | 8092 | 5536 | 5051 |
      | external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
      | resident (MB)  | 8956 | 8365 | 5655 | 5235 |
      
      **T2**
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     256 bytes each (128 bytes after compression)
      Entries:    40000000
      RawSize:    10986.3 MB (estimated)
      FileSize:   6103.5 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T2_3 | 4000 | 4000 |
      | T2_4 | 2000 | 3000 |
      
      Create DB (10GB):
      ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      
      Run read tests with differnt cache setting:
      T2_3:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &
      
      T2_4:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &
      
      For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
      | -  | - | - | - | - |
      | allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
      | available (MB)  | 8489 | 8138 | 5435 | 5158 |
      | external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
      | resident (MB)  | 8676 | 8392 | 5541 | 5321 |
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287
      
      Test Plan: Unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D37743362
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad
      87b82f28
    • M
      Fix race in ExitAsBatchGroupLeader with pipelined writes (#9944) · bef3127b
      mpoeter 提交于
      Summary:
      Resolves https://github.com/facebook/rocksdb/issues/9692
      
      This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix.
      
      The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944
      
      Reviewed By: pdillinger
      
      Differential Revision: D36134604
      
      Pulled By: riversand963
      
      fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a
      bef3127b
    • P
      Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59
      Peter Dillinger 提交于
      Summary:
      TL;DR: due to a recent change, if you drop a column family,
      often that DB will no longer fsync after writing new SST files
      to remaining or new column families, which could lead to data
      loss on power loss.
      
      More bug detail:
      The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
      DB::Close time rather than waiting for DB object destruction.
      Unfortunately, it also closes shared FSDirectory objects on
      DropColumnFamily (& destroy remaining handles), which can lead
      to use-after-Close on FSDirectory shared with remaining column
      families. Those "uses" are only Fsyncs (or redundant Closes). In
      the default Posix filesystem, an Fsync on a closed FSDirectory is a
      quiet no-op. Consequently (under most configurations), if you drop
      a column family, that DB will no longer fsync after writing new SST
      files to column families sharing the same directory (true under most
      configurations).
      
      More fix detail:
      Basically, this removes unnecessary Close ops on destroying
      ColumnFamilyData. We let `shared_ptr` take care of calling the
      destructor at the right time. If the intent was to require Close be
      called before destroying FSDirectory, that was not made clear by the
      author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
      could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
      not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
      timely destruction of FSDirectory to suffice as Close (in
      CountedFileSystem). Added a TODO to revisit that.
      
      Also in this PR:
      * Added a TODO to share FSDirectory instances between DB and its column
      families. (Already shared among column families.)
      * Made DB::Close attempt to close all its open FSDirectory objects even
      if there is a failure in closing one. Also code clean-up around this
      logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
      
      Test Plan:
      add an assert to check for use-after-Close. With that
      existing tests can detect the misuse. With fix, tests pass (except noted
      relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
      
      Reviewed By: ajkr
      
      Differential Revision: D38357922
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
      27f3af59
    • P
      regression_test.sh: kill very old db_bench (and more) (#10441) · 9da97a37
      Peter Dillinger 提交于
      Summary:
      If a db_bench process gets hung or runaway on a machine, that
      could prevent regression_test.sh from ever making progress. To fix that,
      regression_test.sh will now kill any db_bench process that is >12 hours
      old. Also made this more reliable by not using string matching (grep) to
      get db_bench process IDs.
      
      I also had to make some other updates to get local runs working
      reliably:
      * Fix some quoting hell and other dubious complexity with db_bench_cmd
      * Only save a DB for re-use when building it passes
      * Report failed command in more cases
      * Add safeguards against "rm -rf ."
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441
      
      Test Plan:
      manual (local and remote), with temporary changes e.g. to have
      a manageable age threshold etc.
      
      Reviewed By: riversand963
      
      Differential Revision: D38285537
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8
      9da97a37
  5. 02 8月, 2022 4 次提交
    • L
      Do not put blobs read during compaction into cache (#10457) · cc8ded61
      Levi Tamasi 提交于
      Summary:
      During compaction, blobs are currently read using the default
      `ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
      this didn't make any difference since we didn't have a blob cache;
      however, now we have to explicitly set this flag to false to avoid
      polluting the cache during compaction.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D38333528
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
      cc8ded61
    • Y
      Remove unused fields from FileMetaData (temporarily) (#10443) · fbfcf5cb
      Yanqin Jin 提交于
      Summary:
      FileMetaData::[min|max]_timestamp are not currently being used or
      tracked by RocksDB, even when user-defined timestamp is enabled. Each of
      them is a std::string which can occupy 32 bytes. Remove them for now.
      They may be added back when we have a pressing need for them. When we do
      add them back, consider store them in a more compact way, e.g. one
      boolean flag and a byte array of size 16.
      
      Per file min/max timestamp bounds are available as table properties.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443
      
      Test Plan: make check
      
      Reviewed By: pdillinger
      
      Differential Revision: D38292275
      
      Pulled By: riversand963
      
      fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94
      fbfcf5cb
    • S
      Use EnvLogger instead of PosixLogger (#10436) · cc209980
      sdong 提交于
      Summary:
      EnvLogger was built to replace PosixLogger that supports multiple Envs. Make FileSystem use EnvLogger by default, remove Posix FS specific implementation and remove PosixLogger code,
      Some hacky changes are made to make sure iostats are not polluted by logging, in order to pass existing unit tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10436
      
      Test Plan: Run db_bench and watch info log files.
      
      Reviewed By: anand1976
      
      Differential Revision: D38259855
      
      fbshipit-source-id: 67d65874bfba7a33535b6d0dd0ed92cbbc9888b8
      cc209980
    • G
      Add CompressedSecondaryCache into stress test (#10442) · e1b176d2
      gitbw95 提交于
      Summary:
      The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442
      
      Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run  `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `
      
      Reviewed By: anand1976
      
      Differential Revision: D38290796
      
      Pulled By: gitbw95
      
      fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8
      e1b176d2
  6. 01 8月, 2022 1 次提交
    • A
      Provide support for subcompactions with user-defined timestamps (#10344) · 56463d44
      Akanksha Mahajan 提交于
      Summary:
      The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues.
      Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure.
      Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues.
      
      This PR provide support to reenable subcompactions with user-defined timestamps.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344
      
      Test Plan:
      Added new unit test
      - Without fix for Issue1 unit test MultipleSubCompactions fails with error:
      ```
      db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│
      or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
      Received signal 6 (Aborted)                                                                             │
      #0   /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
      Aborted (core dumped)
      ```
      Ran stress test
      `make crash_test_with_ts -j32`
      
      Reviewed By: riversand963
      
      Differential Revision: D38220841
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb
      56463d44
  7. 30 7月, 2022 4 次提交
  8. 29 7月, 2022 2 次提交
  9. 28 7月, 2022 5 次提交
  10. 27 7月, 2022 5 次提交
    • J
      ldb to display public unique id and dump work with key range (#10417) · 6a0010eb
      Jay Zhuang 提交于
      Summary:
      2 ldb command improvements:
      1. `ldb manifest_dump --verbose` display both the internal unique id and public id. which is useful to manually check sst_unique_id between manifest and SST;
      2. `ldb dump` has `--from/to` option, but not working. Add support for that.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10417
      
      Test Plan:
      run the command locally
      ```
      $ ldb manifest_dump --path=MANIFEST-000026 --verbose
      ...
      AddFile: 0 18 1023 'bar' seq:6, type:1 .. 'foo' seq:5, type:1 oldest_ancester_time:1658787615 file_creation_time:1658787615 file_checksum: file_checksum_func_name: Unknown unique_id(internal): {8800772265202404198,16149248642318466463} public_unique_id: F3E0A029B631D7D4-6E402DE08E771780
      ```
      ```
      $ ldb dump --path=000036.sst --from=key000006 --to=key000009
      Sst file format: block-based
      'key000006' seq:2411, type:1 => value6
      'key000007' seq:2412, type:1 => value7
      'key000008' seq:2413, type:1 => value8
      ...
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D38136140
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 8be6eeaa07ff9f089e33011ebe90fd0b69d33bf3
      6a0010eb
    • Z
      Allow sufficient subcompactions under round-robin compaction priority (#10422) · c945a9a6
      Zichen Zhu 提交于
      Summary:
      Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority.
      
      Test Case:
      Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422
      
      Reviewed By: ajkr
      
      Differential Revision: D38186545
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e
      c945a9a6
    • G
      Towards a production-quality ClockCache (#10418) · 9d7de651
      Guido Tagliavini Ponce 提交于
      Summary:
      In this PR we bring ClockCache closer to production quality. We implement the following changes:
      1. Fixed a few bugs in ClockCache.
      2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
      3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
      4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
      5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
      
      As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
      
      Test Plan:
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38170673
      
      Pulled By: guidotag
      
      fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
      9d7de651
    • A
      Transaction.prepare should be public (#10412) · 8db8b98f
      Alan Paxton 提交于
      Summary:
      The absence of a public modifier appears to be an omission. prepare() is necessary for the TM to participate as a peer in a distributed transaction.
      
      Also add basic “yes it does work in java” tests.
      
      Resolves https://github.com/facebook/rocksdb/issues/10283
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10412
      
      Reviewed By: ajkr
      
      Differential Revision: D38135513
      
      Pulled By: riversand963
      
      fbshipit-source-id: ff52b96bc7218bc3bf12845dee49f5d8edf0e297
      8db8b98f
    • J
      Deflake FlushStaleColumnFamilies test (#10409) · 31344714
      Jay Zhuang 提交于
      Summary:
      Make the Stale Flush test more robust by explicitly checking the target CF is
      flushed.  Currently it's flaky because the default CF may have more than 3
      SSTs.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10409
      
      Test Plan:
      the test more likely to fail on a resource limited host:
      ```
      gtest-parallel ./column_family_test --gtest_filter=FormatDef/ColumnFamilyTest.FlushStaleColumnFamilies/0 -r 1000 -w 100
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D38116383
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: e27cc56f76f14d0936504f126104e3d87e3d0d5f
      31344714
  11. 26 7月, 2022 6 次提交
    • J
      full_history_ts_low should be const (#10411) · 84e9b6ee
      Jay Lee 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10411
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38131644
      
      Pulled By: riversand963
      
      fbshipit-source-id: d241521dccff1ab8882ae0726ec368f84b7e8311
      84e9b6ee
    • C
      Add checksum handshake for WAL fragment decompression (#10339) · 2fc6df37
      Changyu Bi 提交于
      Summary:
      If WAL compression is enabled, WAL fragment decompression results are concatenated together in `log::Reader::ReadPhysicalRecord()`. This PR adds checksum handshake to protect memory corruption during the copying process.
      
      `checksum` is renamed to `record_checksum` in `ReadRecord()` to differentiate it from `checksum_` flag that specifies whether CRC32C checksum is verified.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10339
      
      Test Plan: added checksum verification in log_test.cc, `make check -j32`.
      
      Reviewed By: ajkr
      
      Differential Revision: D37763734
      
      Pulled By: cbi42
      
      fbshipit-source-id: c4faa7c76b9ff1df35026edf31adfe4b47ae3154
      2fc6df37
    • A
      Run new benchmark script in branch. (#10303) · e637470f
      Alan Paxton 提交于
      Summary:
      Configure CI to run modernised benchmark script
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10303
      
      Reviewed By: ramvadiv
      
      Differential Revision: D37719116
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 79ecb1cd0abd4d800c6906ba6673268c2adee10e
      e637470f
    • P
      Account for DB ID in stress testing block cache keys (#10388) · 01a2e202
      Peter Dillinger 提交于
      Summary:
      I recently discovered that block cache keys are slightly lower
      quality than previously thought, because my stress testing tool failed
      to simulate the effect of DB ID differences. This change updates the
      tool and gives us data to guide future developments. (No changes to
      production code here and now.)
      
      Nevertheless, the following promise still holds
      
      ```
      // In fact, if our SST files are all < 4TB (see
      // BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated
      // in a single process are guaranteed to have unique cache keys, unless/until
      // number session ids * max file number = 2**86 ...
      ```
      
      because although different DB IDs could cause collision in file number
      and offset data, that would have to be using the same DB session (lower)
      to cause a block cache key collision, which is not possible in the same
      process. (A session is associated with only one DB ID.)
      
      This change fixes cache_bench -stress_cache_key to set and reset DB IDs in
      a parameterized way to evaluate the effect. Previous results assumed to
      be representative (using -sck_keep_bits=43):
      
      ```
      15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected)
      ```
      
      or expected collision on a single machine every 104 billion billion
      days (see "corrected" value).
      
      After accounting for DB IDs, test never really changing, intermediate, and very
      frequently changing (using default -sck_db_count=100):
      
      ```
      -sck_newdb_nreopen=1000000000:
      15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected)
      -sck_newdb_nreopen=10000:
      17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected)
      -sck_newdb_nreopen=100:
      19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected)
      ```
      
      or roughly 10x more often than previously thought (still extremely if
      not impossibly rare), and better than random base cache keys
      (with -sck_randomize), though < 10x better than random:
      
      ```
      31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected)
      ```
      
      If we simply fixed this by ignoring DB ID for cache keys, we would
      potentially have a shortage of entropy for some cases, such as small
      file numbers and offsets (e.g. many short-lived processes each using
      SstFileWriter to create a small file), because existing DB session IDs
      only provide ~103 bits of entropy. We could upgrade the entropy in DB
      session IDs to accommodate, but it's not known what all would be
      affected by changing from 20 digit session IDs to something larger.
      
      Instead, my plan is to
      1) Move to block cache keys derived from SST unique IDs (so that we can
      derive block cache keys from manifest data without reading file on
      storage), and show no significant regression in expected collision
      rate.
      2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058),
      which should have ~100x lower expected/predicted collision rate based
      on simulations with this stress test:
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id
      ...
      15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388
      
      Test Plan: no production changes
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37986714
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df
      01a2e202
    • S
      Fix a bug in hash linked list (#10401) · 4e007480
      sdong 提交于
      Summary:
      In hash linked list, with a bucket of only one record, following sequence can cause users to temporarily miss a record:
      
      Thread 1: Fetch the structure bucket x points too, which would be a Node n1 for a key, with next pointer to be null
      Thread 2: Insert a key to bucket x that is larger than the existing key. This will make n1->next points to a new node n2, and update bucket x to point to n1.
      Thread 1: see n1->next is not null, so it thinks it is a header of linked list and ignore the key of n1.
      
      Fix it by refetch structure that bucket x points to when it sees n1->next is not null. This should work because if n1->next is not null, bucket x should already point to a linked list or skip list header.
      
      A related change is to revert th order of testing for linked list and skip list. This is because after refetching the bucket, it might end up with a skip list, rather than linked list.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10401
      
      Test Plan: Run existing tests and make sure at least it doesn't regress.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38064471
      
      fbshipit-source-id: 142bb85e1546c803f47e3357aef3e76debccd8df
      4e007480
    • G
      Lock-free ClockCache (#10390) · 6a160e1f
      Guido Tagliavini Ponce 提交于
      Summary:
      ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38106945
      
      Pulled By: guidotag
      
      fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1
      6a160e1f
  12. 25 7月, 2022 1 次提交
    • Z
      Support subcmpct using reserved resources for round-robin priority (#10341) · 8860fc90
      Zichen Zhu 提交于
      Summary:
      Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
       * Constraint 1: We can only pick consecutive files
         - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
         - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
       * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
       * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
       * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
      
      More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
      
      The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit.  The number of subcompactions for round-robin compaction priority is determined through the following steps:
      * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
      * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
      * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
      
      More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
      
      Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
      
      Reviewed By: ajkr, hx235
      
      Differential Revision: D37792644
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
      8860fc90
  13. 24 7月, 2022 1 次提交
    • S
      Improve SubCompaction Partitioning (#10393) · 252bea40
      sdong 提交于
      Summary:
      Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393
      
      Test Plan:
      1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
      2. Add a new unit test to ApproximateKeyAnchors()
      3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38043783
      
      fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
      252bea40