1. 24 6月, 2022 7 次提交
    • M
      Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215) · 60619057
      Mark Callaghan 提交于
      Summary:
      This provides two things:
      1) Runs a sequence of db_bench tests. This sequence was chosen to provide
      good coverage with less variance.
      2) Makes it easier to do A/B testing for multiple binaries. This combines
      the report.tsv files into summary.tsv to make it easier to compare results
      across multiple binaries.
      
      Example output for 2) is:
      
      ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
      1115171 446.7   9GB             8.9     1.0     454.7   26      26      0       0       0.9     0.5     2       7       51      5547    20      0.0     0       0.1     0.1     0.2     fillseq.wal_disabled.v400       2022-04-12T08:53:51     6.0
      1045726 418.9   8GB     0.0GB   8.4     1.0     432.4   27      26      0       0       1.0     0.5     2       6       102     5618    20      0.0     0       0.1     0.0     0.1     fillseq.wal_disabled.v400       2022-04-12T12:25:36     6.28
      
      ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
      2969192 1189.3  16GB            0.0             0.0     0       0       0       0       10.8    9.3     25      33      49      13551   1781    0.0     0       48.2    6.8     16.8    readrandom.t32  2022-04-12T08:54:28     6.0
      2692922 1078.6  16GB    0.0GB   0.0             0.0     0       0       0       0       11.9    10.2    30      38      56      49735   1781    0.0     0       47.8    6.7     16.8    readrandom.t32  2022-04-12T12:26:15     6.28
      
      ...
      
      ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
      180227  72.2    38GB            1126.4  8.7     643.2   3286    3218    0       0       177.6   50.2    2687    4083    6148    854083  1793    68.4    7804    17.0    5.9     0.5     overwrite.t32.s0        2022-04-12T11:55:21     6.0
      236512  94.7    31GB    0.0GB   1502.9  8.9     862.2   5242    5125    0       0       135.3   59.9    2537    3268    5404    18545   1785    49.7    5112    25.5    8.0     9.4     overwrite.t32.s0        2022-04-12T15:27:25     6.28
      
      Example output with formatting preserved is here:
      https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215
      
      Test Plan: run it
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37299892
      
      Pulled By: mdcallag
      
      fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8
      60619057
    • Y
      Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175) · 2a3792ed
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add suggest_compact_range() and suggest_compact_range_cf() to C API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175
      
      Test Plan:
      As verifying the result requires SyncPoint, which is not available in the c_test.c,
      the test is currently done by invoking the functions and making sure it does not crash.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37305191
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
      2a3792ed
    • Z
      Cut output files at compaction cursors (#10227) · 17a1d65e
      zczhu 提交于
      Summary:
      The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227
      
      Test Plan:
      Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`
      
      Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37388088
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
      17a1d65e
    • G
      Read from blob cache first when MultiGetBlob() (#10225) · ba1f62dd
      Gang Liao 提交于
      Summary:
      There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225
      
      Test Plan:
      Add test cases for MultiGetBlob
      In this task, we added the new API MultiGetBlob() for BlobSource.
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Reviewed By: ltamasi
      
      Differential Revision: D37358364
      
      Pulled By: gangliao
      
      fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
      ba1f62dd
    • G
      Fix key size in cache_bench (#10234) · b52620ab
      Guido Tagliavini Ponce 提交于
      Summary:
      cache_bench wasn't generating 16B keys, which are necessary for FastLRUCache. Also:
      - Added asserts in cache_bench, which is assuming that inserts never fail. When they fail (for example, if we used keys of the wrong size), memory allocated to the values will becomes leaked, and eventually the program crashes.
      - Move kCacheKeySize to the right spot.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10234
      
      Test Plan:
      ``make -j24 check``. Also, run cache_bench with FastLRUCache and check that memory usage doesn't blow up:
      ``./cache_bench -cache_type=fast_lru_cache -num_shard_bits=6 -skewed=true \
                              -lookup_insert_percent=100 -lookup_percent=0 -insert_percent=0 -erase_percent=0 \
                              -populate_cache=true -cache_size=1073741824 -ops_per_thread=10000000 \
                              -value_bytes=8192 -resident_ratio=1 -threads=16``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37382949
      
      Pulled By: guidotag
      
      fbshipit-source-id: b697a942ebb215de5d341f98dc8566763436ba9b
      b52620ab
    • P
      Don't count no prefix as Bloom hit (#10244) · f81ea75d
      Peter Dillinger 提交于
      Summary:
      When a key is "out of domain" for the prefix_extractor (no
      prefix assigned) then the Bloom filter is not queried. PerfContext
      was counting this as a Bloom "hit" while Statistics doesn't count this
      as a prefix Bloom checked. I think it's more accurate to call it neither
      hit nor miss, so changing the counting to make it PerfContext coounting
      more like Statistics.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244
      
      Test Plan:
      tests updates and expanded (Get and MultiGet). Iterator test
      coverage of the change will come in next PR
      
      Reviewed By: bjlemaire
      
      Differential Revision: D37371297
      
      Pulled By: pdillinger
      
      fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157
      f81ea75d
    • B
      Dynamically changeable `MemPurge` option (#10011) · 5879053f
      Baptiste Lemaire 提交于
      Summary:
      **Summary**
      Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
      
      **Motivation**
      RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
      Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
      
      **Content of this PR**
      This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
      
      **Benchmarking**
      I will add numbers to prove that there is no performance impact within the next 12 hours.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
      
      Reviewed By: pdillinger
      
      Differential Revision: D36462357
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
      5879053f
  2. 23 6月, 2022 4 次提交
    • G
      Add the blob cache to the stress tests and the benchmarking tool (#10202) · 2352e2df
      Gang Liao 提交于
      Summary:
      In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
      As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202
      
      Reviewed By: ltamasi
      
      Differential Revision: D37325739
      
      Pulled By: gangliao
      
      fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
      2352e2df
    • B
      Fix typo in comments and code (#10233) · c073ed76
      Bo Wang 提交于
      Summary:
      Fix typo in comments and code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233
      
      Test Plan: Existing unit tests should pass.
      
      Reviewed By: jay-zhuang, anand1976
      
      Differential Revision: D37356702
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51
      c073ed76
    • Y
      Add get_column_family_metadata() and related functions to C API (#10207) · e103b872
      Yueh-Hsuan Chiang 提交于
      Summary:
      * Add metadata related structs and functions in C API, including
        - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
           that returns `rocksdb_column_family_metadata_t`.
        - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
        - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
        - `rocksdb_file_metadata_t` and its and get functions & destroy functions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207
      
      Test Plan:
      Extend the existing c_test.c to include additional checks for column_family_metadata
      inside CheckCompaction.
      
      Reviewed By: riversand963
      
      Differential Revision: D37305209
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
      e103b872
    • A
      Adapt benchmark result script to new fields. (#10120) · a16e2ff8
      Alan Paxton 提交于
      Summary:
      Recently merged CI benchmark scripts were failing.
      
      There has clearly been a major revision of the fields of benchmark output. The upload script expects and sanity-checks the existence of some fields (changes date to conform to OpenSearch format)..., so the script needs to change.
      
      Also add a bit more exception checking to make it more obvious when this happens again.
      
      We have deleted the existing report.tsv from the benchmark machine. An existing report.tsv is appended to by default, so that if the fields change, later rows no longer match the header. This makes for an upload that dies half way through the report file, when the format no longer matches the header.
      
      Re-instate the config.yml for running the benchmarks, so we can once again test it in situ.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10120
      
      Reviewed By: pdillinger
      
      Differential Revision: D37314908
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 34f5243fee694b75c6838eb55d3398e4273254b2
      a16e2ff8
  3. 22 6月, 2022 8 次提交
    • Y
      Continue to deflake BackupEngineTest.Concurrency (#10228) · 36fefd7e
      Yanqin Jin 提交于
      Summary:
      Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure.
      
      Repro steps as follows
      ```bash
      make backup_engine_test
      gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
      ```
      
      The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which
      `Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()`
      allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses
      `RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`.
      
      This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()`
      does not exist. We should make the behaviors of the test Env/FS match a real Env/FS.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228
      
      Test Plan:
      ```bash
      make check
      gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D37337241
      
      Pulled By: riversand963
      
      fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d
      36fefd7e
    • Y
      Expose the initial logger creation error (#10223) · 9586dcf1
      Yanqin Jin 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
      `DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
      directly expose the error to caller without some additional work.
      This is a first version proposal which:
      - Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
      - Checks the error during `DB::Open()` and return it to caller if non-ok
      
      This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
      Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
      in case other callers of `SanitizeOptions()` assumes that a logger should be created.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223
      
      Test Plan: make check
      
      Reviewed By: pdillinger
      
      Differential Revision: D37321717
      
      Pulled By: riversand963
      
      fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
      9586dcf1
    • Y
      Update API comment about Options::best_efforts_recovery (#10180) · 42c631b3
      Yanqin Jin 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10180
      
      Reviewed By: pdillinger
      
      Differential Revision: D37182037
      
      Pulled By: riversand963
      
      fbshipit-source-id: a8dc865b86e2249beb7a543c317e94a14781e910
      42c631b3
    • P
      Add data block hash index to crash test, fix MultiGet issue (#10220) · 84210c94
      Peter Dillinger 提交于
      Summary:
      There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
      block hash index, which was not caught because data block hash index was
      never added to stress tests. This change fixes both issues.
      
      Fixes https://github.com/facebook/rocksdb/issues/10186
      
      I intend to pick this into the 7.4.0 release candidate
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220
      
      Test Plan:
      Failure quickly reproduces in crash test with
      kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
      the failure with a unit test I believe would be too tricky and fragile
      to be worthwhile.
      
      Reviewed By: anand1976
      
      Differential Revision: D37315647
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
      84210c94
    • Y
      Refactor wal filter processing during recovery (#10214) · d654888b
      Yanqin Jin 提交于
      Summary:
      So that DBImpl::RecoverLogFiles do not have to deal with implementation
      details of WalFilter.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D37299122
      
      Pulled By: riversand963
      
      fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816
      d654888b
    • B
      Update LZ4 library for platform009 (#10224) · f7605ec6
      Bo Wang 提交于
      Summary:
      Update LZ4 library for platform009.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10224
      
      Test Plan: Current unit tests should pass.
      
      Reviewed By: anand1976
      
      Differential Revision: D37321801
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 8a3d3019d9f7478ac737176f2d2f443c0159829e
      f7605ec6
    • Z
      Add basic kRoundRobin compaction policy (#10107) · 30141461
      zczhu 提交于
      Summary:
      Add `kRoundRobin` as a compaction priority. The implementation is as follows.
      
      - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
      - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
      - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107
      
      Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`
      
      Reviewed By: ajkr
      
      Differential Revision: D37316037
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
      30141461
    • Y
      Destroy iniital db dir for a test in DBWALTest (#10221) · b012d235
      Yanqin Jin 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221
      
      Reviewed By: hx235
      
      Differential Revision: D37316280
      
      Pulled By: riversand963
      
      fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572
      b012d235
  4. 21 6月, 2022 6 次提交
  5. 20 6月, 2022 2 次提交
  6. 19 6月, 2022 1 次提交
  7. 18 6月, 2022 5 次提交
    • A
      Fix race condition with WAL tracking and `FlushWAL(true /* sync */)` (#10185) · d5d8920f
      Andrew Kryczka 提交于
      Summary:
      `FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.
      
      The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.
      
      An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.
      
      ```
      Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185
      
      Test Plan:
      - repro unit test
      - the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"
      
      Reviewed By: riversand963
      
      Differential Revision: D37200993
      
      Pulled By: ajkr
      
      fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
      d5d8920f
    • H
      Add rate-limiting support to batched MultiGet() (#10159) · a5d773e0
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
      
      However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
      
      Test Plan:
      - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
      - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
      - Stress test:
         - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
      
      Reviewed By: ajkr, anand1976
      
      Differential Revision: D37135172
      
      Pulled By: hx235
      
      fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
      a5d773e0
    • G
      Read blob from blob cache if exists when GetBlob() (#10178) · c965c9ef
      Gang Liao 提交于
      Summary:
      There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
      In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`.  `MultiGetBlob()` will be included in the next PR.
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178
      
      Reviewed By: ltamasi
      
      Differential Revision: D37250507
      
      Pulled By: gangliao
      
      fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
      c965c9ef
    • P
      Use optimized folly DistributedMutex in LRUCache when available (#10179) · 1aac8145
      Peter Dillinger 提交于
      Summary:
      folly DistributedMutex is faster than standard mutexes though
      imposes some static obligations on usage. See
      https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
      for details. Here we use this alternative for our Cache implementations
      (especially LRUCache) for better locking performance, when RocksDB is
      compiled with folly.
      
      Also added information about which distributed mutex implementation is
      being used to cache_bench output and to DB LOG.
      
      Intended follow-up:
      * Use DMutex in more places, perhaps improving API to support non-scoped
      locking
      * Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)
      
      Credit: Thanks Siying for reminding me about this line of work that was previously
      left unfinished.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179
      
      Test Plan:
      for correctness, existing tests. CircleCI config updated.
      Also Meta-internal buck build updated.
      
      For performance, ran simultaneous before & after cache_bench. Out of three
      comparison runs, the middle improvement to ops/sec was +21%:
      
      Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
      compiler)
      
      ```
      Complete in 20.201 s; Rough parallel ops/sec = 1584062
      Thread ops/sec = 107176
      
      Operation latency (ns):
      Count: 32000000 Average: 9257.9421  StdDev: 122412.04
      Min: 134  Median: 3623.0493  Max: 56918500
      Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
      ```
      
      New: (add USE_FOLLY=1)
      
      ```
      Complete in 16.674 s; Rough parallel ops/sec = 1919135  (+21%)
      Thread ops/sec = 135487
      
      Operation latency (ns):
      Count: 32000000 Average: 7304.9294  StdDev: 108530.28
      Min: 132  Median: 3777.6012  Max: 91030902
      Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D37182983
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
      1aac8145
    • P
      Fix overflow in ribbon_bench after #10184 (#10195) · f87adcfb
      Peter Dillinger 提交于
      Summary:
      Ribbon micro-bench needs updating after re-numbering
      `BloomLikeFilterPolicy::GetAllFixedImpls()` entries. (CircleCI nightly
      failure.)
      
      Also fixed memory leaks while using ASAN to validate my fix. (I assume
      the leaks weren't intentional for some performance characteristic.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10195
      
      Test Plan: run with ASAN
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37244459
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5a363e10de3c4c9c88099c937e3dc3b4cf24fd30
      f87adcfb
  8. 17 6月, 2022 6 次提交
    • A
      Add WriteOptions::protection_bytes_per_key (#10037) · 5d6005c7
      Andrew Kryczka 提交于
      Summary:
      Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
      
      Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
      
      There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
      
      Test Plan:
      - Manual
        - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
        - Enabled in MyShadow for 1+ week
      - Automated
        - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
        - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
      
      Reviewed By: cbi42
      
      Differential Revision: D36614569
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
      5d6005c7
    • P
      Fix a false negative merge conflict (#10192) · f62c1e1e
      Peter Dillinger 提交于
      Summary:
      .. between https://github.com/facebook/rocksdb/issues/10184 and https://github.com/facebook/rocksdb/issues/10122 not detected by source control,
      leading to non-compiling code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10192
      
      Test Plan: updated test
      
      Reviewed By: hx235
      
      Differential Revision: D37231921
      
      Pulled By: pdillinger
      
      fbshipit-source-id: fa21488716f4c006b111b8c4127d71c757c935c3
      f62c1e1e
    • C
      Update HISTORY.md for #10114 (#10189) · 8cf86258
      Changyu Bi 提交于
      Summary:
      Update HISTORY.md for https://github.com/facebook/rocksdb/issues/10114: write batch checksum verification before writing to WAL.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10189
      
      Reviewed By: ajkr
      
      Differential Revision: D37226366
      
      Pulled By: cbi42
      
      fbshipit-source-id: cd2f076961abc35f35783e0f2cc3beda68cdb446
      8cf86258
    • P
      More testing w/prefix extractor, small refactor (#10122) · fff302d9
      Peter Dillinger 提交于
      Summary:
      There was an interesting code path not covered by testing that
      is difficult to replicate in a unit test, which is now covered using a
      sync point. Specifically, the case of table_prefix_extractor == null and
      !need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which
      can happen if table reader is open before extractor is registered with global
      object registry, but is later registered and re-set with SetOptions. (We
      don't have sufficient testing control over object registry to set that up
      repeatedly.)
      
      Also, this function has been renamed to `PrefixRangeMayMatch` for clarity
      vs. other functions that are not the same.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122
      
      Test Plan: unit tests expanded
      
      Reviewed By: siying
      
      Differential Revision: D36944834
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb
      fff302d9
    • P
      Remove deprecated block-based filter (#10184) · 126c2237
      Peter Dillinger 提交于
      Summary:
      In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
      the public API, because of its inefficiency. Although we normally maintain read compatibility
      on old DBs forever, filters are not required for reading a DB, only for optimizing read
      performance. Thus, it should be acceptable to remove this code and the substantial
      maintenance burden it carries as useful features are developed and validated (such
      as user timestamp).
      
      This change completely removes the code for reading and writing the old block-based
      filters, net removing about 1370 lines of code no longer needed. Options removed from
      testing / benchmarking tools. The prior existence is only evident in a couple of places:
      * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
      a major release to minimize source code incompatibilities.
      * A warning is logged when an old table file is opened that used the old block-based
      filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
      should suffice. Unfortunately, sst_dump does not tell you whether a file uses
      block-based filter, and the structure of the code makes it very difficult to fix.
      * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
      for metaindex is maintained (for now).
      
      Other notes:
      * In some cases where numbers are associated with filter configurations, we have had to
      update the assigned numbers so that they all correspond to something that exists.
      * Fixed potential stat counting bug by assuming `filter_checked = false` for cases
      like `filter == nullptr` rather than assuming `filter_checked = true`
      * Removed obsolete `block_offset` and `prefix_extractor` parameters from several
      functions.
      * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
      because the caller guarantees the prefix extractor exists and is compatible
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
      
      Test Plan:
      tests updated, manually test new warning in LOG using base version to
      generate a DB
      
      Reviewed By: riversand963
      
      Differential Revision: D37212647
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
      126c2237
    • A
      Update stats to help users estimate MultiGet async IO impact (#10182) · a6691d0f
      anand76 提交于
      Summary:
      Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements -
      1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file
      2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used
      
      The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37213296
      
      Pulled By: anand1976
      
      fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb
      a6691d0f
  9. 16 6月, 2022 1 次提交
    • Y
      Abort in dbg mode after logging (#10183) · 4d31d3c2
      Yanqin Jin 提交于
      Summary:
      In CompactionIterator code, there are multiple places where the process
      will abort in dbg mode before logging the error message describing the
      cause. This PR changes only the logging behavior for compaction iterator so
      that error message is written to LOG before the process aborts in debug
      mode.
      
      Also updated the triggering condition for an assertion for single delete with
      user-defined timestamp.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183
      
      Test Plan: make check
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37190218
      
      Pulled By: riversand963
      
      fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009
      4d31d3c2