1. 06 7月, 2022 4 次提交
    • Y
      Expand stress test coverage for user-defined timestamp (#10280) · caced09e
      Yanqin Jin 提交于
      Summary:
      Before this PR, we call `now()` to get the wall time before performing point-lookup and range
      scans when user-defined timestamp is enabled.
      
      With this PR, we expand the coverage to:
      - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now()
      - add coverage for `ReadOptions::iter_start_ts != nullptr`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280
      
      Test Plan:
      ```bash
      make check
      ```
      
      Also,
      ```bash
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
      ```
      
      So far, we have had four successful runs of the above
      
      In addition,
      ```bash
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test
      ```
      Succeeded twice showing no regression.
      
      Reviewed By: ltamasi
      
      Differential Revision: D37539805
      
      Pulled By: riversand963
      
      fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5
      caced09e
    • M
      Add the git hash and full RocksDB version to report.tsv (#10277) · 9eced1a3
      Mark Callaghan 提交于
      Summary:
      Previously the version was displayed as $major.$minor
      This changes it to $major.$minor.$path
      
      This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
      who consume/graph these results are OK with it.
      
      Example output:
      ops_sec	mb_sec	lsm_sz	blob_sz	c_wgb	w_amp	c_mbps	c_wsecs	c_csecs	b_rgb	b_wgb	usec_op	p50	p99	p99.9	p99.99	pmax	uptime	stall%	Nstall	u_cpu	s_cpu	rss	test	date	version	job_id	githash
      609488	244.1	1GB	0.0GB,	1.4	0.7	93.3	39	38	0	0	1.6	1.0	4	15	26	5365	15	0.0	0	0.1	0.0	0.5	fillseq.wal_disabled.v400	2022-06-29T13:36:05	7.5.0		61152544
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277
      
      Test Plan: Run it
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37532418
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
      9eced1a3
    • S
      Try to trivial move more than one files (#10190) · a9565ccb
      sdong 提交于
      Summary:
      In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.
      
      When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190
      
      Test Plan:
      Add some unit tests.
      For performance, Try to run
      ./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
      Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37230647
      
      fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
      a9565ccb
    • Y
      Update code comment and logging for secondary instance (#10260) · d6b9c4ae
      Yanqin Jin 提交于
      Summary:
      Before this PR, it is required that application open RocksDB secondary
      instance with `max_open_files = -1`. This is a hacky workaround that
      prevents IOErrors on the seconary instance during point-lookup or range
      scan caused by primary instance deleting the table files. This is not
      necessary if the application can coordinate the primary and secondaries
      so that primary does not delete files that are still being used by the
      secondaries. Or users can provide a custom Env/FS implementation that
      deletes the files only after all primary and secondary instances
      indicate files are obsolete and deleted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260
      
      Test Plan: make check
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37462633
      
      Pulled By: riversand963
      
      fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c
      d6b9c4ae
  2. 04 7月, 2022 1 次提交
  3. 02 7月, 2022 1 次提交
    • G
      Fix CalcHashBits (#10295) · 54f678cd
      Guido Tagliavini Ponce 提交于
      Summary:
      We fix two bugs in CalcHashBits. The first one is an off-by-one error: the desired number of table slots is the real number ``capacity / (kLoadFactor * handle_charge)``, which should not be rounded down. The second one is that we should disallow inputs that set the element charge to 0, namely ``estimated_value_size == 0 && metadata_charge_policy == kDontChargeCacheMetadata``.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10295
      
      Test Plan: CalcHashBits is tested by CalcHashBitsTest (in lru_cache_test.cc). The test now iterates over many more inputs; it covers, in particular, the rounding error edge case. Overall, the test is now more robust. Run ``make -j24 check``.
      
      Reviewed By: pdillinger
      
      Differential Revision: D37573797
      
      Pulled By: guidotag
      
      fbshipit-source-id: ea4f4439f7196ab1c1afb88f566fe92850537262
      54f678cd
  4. 01 7月, 2022 10 次提交
    • Z
      Add FLAGS_compaction_pri into crash_test (#10255) · e716bda0
      zczhu 提交于
      Summary:
      Add FLAGS_compaction_pri into correctness test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255
      
      Test Plan: run crash_test with FLAGS_compaction_pri
      
      Reviewed By: ajkr
      
      Differential Revision: D37510372
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641
      e716bda0
    • A
      Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292) · 11215e0f
      Akanksha Mahajan 提交于
      Summary:
      If dbname and db_log_dir are at different filesystems (one
      local and one remote), creation of dbname will fail because that path
      doesn't exist wrt to db_log_dir.
      This patch will ignore the error returned on creation of dbname. If they
      are on same filesystem, db_log_dir creation will automatically return
      the error in case there is any error in creation of dbname.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292
      
      Test Plan: Existing unit tests
      
      Reviewed By: riversand963
      
      Differential Revision: D37567773
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100
      11215e0f
    • S
      Multi-File Trivial Move in L0->L1 (#10188) · 4428c761
      sdong 提交于
      Summary:
      In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
      1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
      2. Modify the trivial move condition so that this compaction would be tagged as trivial move.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188
      
      Test Plan:
      See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:
      
      ./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
      
      The throughput improved by about 50%. Stalling still happens though.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37224743
      
      fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
      4428c761
    • Z
      Remove compact cursor when split sub-compactions (#10289) · 4f51101d
      zczhu 提交于
      Summary:
      In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here.  The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289
      
      Reviewed By: ajkr
      
      Differential Revision: D37559091
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc
      4f51101d
    • M
      Add undefok for BlobDB options not supported prior to 7.5 (#10276) · 720ab355
      Mark Callaghan 提交于
      Summary:
      This adds --undefok to support use of this script with BlobDB for db_bench versions prior
      to 7.5 when the options land in a release.
      
      While there is a limit to how far back this script can go WRT backwards compatiblity,
      this is an easy change to support early 7.x releases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276
      
      Test Plan: Run it with versions of db_bench that do not and then do support these options
      
      Reviewed By: gangliao
      
      Differential Revision: D37529299
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
      720ab355
    • S
      Change The Way Level Target And Compaction Score Are Calculated (#10057) · b397dcd3
      sdong 提交于
      Summary:
      The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
      Basic idea:
      (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
      (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057
      
      Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.
      
      Reviewed By: ajkr
      
      Differential Revision: D37539742
      
      fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
      b397dcd3
    • G
      Enable blob caching for MultiGetBlob in RocksDB (#10272) · 056e08d6
      Gang Liao 提交于
      Summary:
      - [x] Enabled blob caching for MultiGetBlob in RocksDB
      - [x] Refactored MultiGetBlob logic and interface in RocksDB
      - [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
      - [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272
      
      Reviewed By: ltamasi
      
      Differential Revision: D37558112
      
      Pulled By: gangliao
      
      fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
      056e08d6
    • A
      include compaction cursors in VersionEdit debug string (#10288) · 20754b36
      Andrew Kryczka 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10288
      
      Test Plan:
      try it out -
      
      ```
      $ ldb manifest_dump --db=/dev/shm/rocksdb.0uWV/rocksdb_crashtest_whitebox/ --hex --verbose | grep CompactCursor | head -3
        CompactCursor: 1 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1
        CompactCursor: 1 '0000000000001F35000000000000012B0000000000000022' seq:0, type:1
        CompactCursor: 2 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1
      ```
      
      Reviewed By: littlepig2013
      
      Differential Revision: D37557177
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7b76b857d9e7a9f3d53398a61bb1d4b78873b91e
      20754b36
    • Y
      Add load_latest_options() to C api (#10152) · 17a6f7fa
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add load_latest_options() to C api.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152
      
      Test Plan:
      Extend the existing c_test by reopening db using the latest options file
      at different parts of the test.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37305225
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962
      17a6f7fa
    • Y
      Fix assertion error with read_opts.iter_start_ts (#10279) · b87c3557
      Yanqin Jin 提交于
      Summary:
      If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure.
      Test plan
      make check
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279
      
      Reviewed By: ltamasi
      
      Differential Revision: D37539393
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315
      b87c3557
  5. 30 6月, 2022 5 次提交
    • G
      Clock cache (#10273) · 57a0e2f3
      Guido Tagliavini Ponce 提交于
      Summary:
      This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
      - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
      - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
      - Middle insertions into the clock list.
      - A test that exercises the clock eviction policy.
      - Update the Java API of ClockCache and Java calls to C++.
      
      Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37522461
      
      Pulled By: guidotag
      
      fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
      57a0e2f3
    • J
      Fix GetWindowsErrSz nullptr bug (#10282) · c2dc4c0c
      Johnny Shaw 提交于
      Summary:
      `GetWindowsErrSz` may assign a `nullptr` to `std::string` in the event it cannot format the error code to a string. This will result in a crash when `std::string` attempts to calculate the length from `nullptr`.
      
      The change here checks the output from `FormatMessageA` and only assigns to the otuput `std::string` if it is not null. Additionally, the call to free the buffer is only made if a non-null value is returned from `FormatMessageA`. In the event `FormatMessageA` does not output a string, an empty string is returned instead.
      
      Fixes https://github.com/facebook/rocksdb/issues/10274
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10282
      
      Reviewed By: riversand963
      
      Differential Revision: D37542143
      
      Pulled By: ajkr
      
      fbshipit-source-id: c21f5119ddb451f76960acec94639d0f538052f2
      c2dc4c0c
    • L
      WriteBatch reorder fields to reduce padding (#10266) · 490fcac0
      leipeng 提交于
      Summary:
      this reorder reduces sizeof(WriteBatch) by 16 bytes
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10266
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37505201
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6cb6c3735073fcb63921f822d5e15670fecb1c26
      490fcac0
    • S
      Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270) · 61152544
      sdong 提交于
      Summary:
      Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.
      
      Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270
      
      Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.
      
      Reviewed By: ajkr
      
      Differential Revision: D37497327
      
      fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
      61152544
    • E
      Expose LRU cache num_shard_bits paramater in C api (#10222) · 12bfd519
      Edvard Davtyan 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10222
      
      Reviewed By: cbi42
      
      Differential Revision: D37358171
      
      Pulled By: ajkr
      
      fbshipit-source-id: e86285fdceaec943415ee9d482090009b00cbc95
      12bfd519
  6. 29 6月, 2022 5 次提交
    • M
      Benchmark fix write amplification computation (#10236) · 28f2d3cc
      Mark Callaghan 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236
      
      Reviewed By: ajkr
      
      Differential Revision: D37489898
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145
      28f2d3cc
    • Y
      Support `iter_start_ts` for backward iteration (#10200) · b6cfda12
      Yanqin Jin 提交于
      Summary:
      Resolves https://github.com/facebook/rocksdb/issues/9761
      
      With this PR, applications can create an iterator with the following
      ```cpp
      ReadOptions read_opts;
      read_opts.timestamp = &ts_ub;
      read_opts.iter_start_ts = &ts_lb;
      auto* it = db->NewIterator(read_opts);
      it->SeekToLast();
      // or it->SeekForPrev("foo");
      it->Prev();
      ...
      ```
      The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D37258074
      
      Pulled By: riversand963
      
      fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565
      b6cfda12
    • P
      Update/clarify required properties for prefix extractors (#10245) · d96febee
      Peter Dillinger 提交于
      Summary:
      Most of the properties listed as required for prefix extractors
      are not really required but offer some conveniences. This updates API
      comments to clarify actual requirements, and adds tests to demonstrate
      how previously presumed requirements can be safely violated.
      
      This might seem like a useless exercise, but this relaxing of requirements
      would be needed if we generalize prefixes to group keys not just at the
      byte level but also based on bits or arbitrary value ranges. For
      applications without a "natural" prefix size, having only byte-level
      granularity often means one prefix size to the next differs in magnitude
      by a factor of 256.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245
      
      Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244
      
      Reviewed By: bjlemaire
      
      Differential Revision: D37371559
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244
      d96febee
    • A
      Deflake RateLimiting/BackupEngineRateLimitingTestWithParam (#10271) · ca81b80d
      Andrew Kryczka 提交于
      Summary:
      We saw flakes with the following failure:
      
      ```
      [ RUN      ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
      utilities/backup/backup_engine_test.cc:2667: Failure
      Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
      terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
      what():  utilities/backup/backup_engine_test.cc:2667: Failure
      Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
      Received signal 6 (Aborted)
      t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted                 (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
      ```
      
      Investigation revealed we forgot to use the mock time `SystemClock` for
      restore rate limiting. Then the test used wall clock time, which made
      the execution of "GenericRateLimiter::Request:PostTimedWait"
      non-deterministic as wall clock time might have advanced enough that
      waiting was not needed.
      
      This PR changes restore rate limiting to use
      mock time, which guarantees we always execute
      "GenericRateLimiter::Request:PostTimedWait". Then the assertions that
      rely on times recorded inside that callback should be robust.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271
      
      Test Plan:
      Applied the following patch which guaranteed repro before the fix.
      Verified the test passes after this PR even with that patch applied.
      
      ```
       diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc
      index f369e3220..6b3ed82fa 100644
       --- a/util/rate_limiter.cc
      +++ b/util/rate_limiter.cc
      @@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) {
      
       void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri,
                                        Statistics* stats) {
      +  usleep(100000);
         assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed));
         bytes = std::max(static_cast<int64_t>(0), bytes);
         TEST_SYNC_POINT("GenericRateLimiter::Request");
      ```
      
      Reviewed By: hx235
      
      Differential Revision: D37499848
      
      Pulled By: ajkr
      
      fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e
      ca81b80d
    • G
      Add blob cache tickers, perf context statistics, and DB properties (#10203) · d7ebb58c
      Gang Liao 提交于
      Summary:
      In order to be able to monitor the performance of the new blob cache, we made the follow changes:
      - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
      - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
      - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203
      
      Reviewed By: ltamasi
      
      Differential Revision: D37478658
      
      Pulled By: gangliao
      
      fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40
      d7ebb58c
  7. 28 6月, 2022 4 次提交
  8. 26 6月, 2022 1 次提交
    • L
      Add API for writing wide-column entities (#10242) · c73d2a9d
      Levi Tamasi 提交于
      Summary:
      The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
      a new API called `PutEntity` that can be used to write a wide-column entity
      to the database. The new API is added to both `DB` and `WriteBatch`. Note
      that currently there is no way to retrieve these entities; more precisely, all
      read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
      encounter a wide-column entity that is required to answer a query. Read-side
      support (as well as other missing functionality like `Merge`, compaction filter,
      and timestamp support) will be added in later PRs.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D37369748
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
      c73d2a9d
  9. 25 6月, 2022 4 次提交
    • A
      Temporarily disable mempurge in crash test (#10252) · f322f273
      Andrew Kryczka 提交于
      Summary:
      Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
      
      Reviewed By: riversand963
      
      Differential Revision: D37432948
      
      Pulled By: ajkr
      
      fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
      f322f273
    • B
      Pass rate_limiter_priority through filter block reader functions to FS (#10251) · 8e63d90f
      Bo Wang 提交于
      Summary:
      With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251
      
      Test Plan: Current unit tests should pass.
      
      Reviewed By: pdillinger
      
      Differential Revision: D37427667
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433
      8e63d90f
    • Z
      Fix the flaky cursor persist test (#10250) · 410ca2ef
      zczhu 提交于
      Summary:
      The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250
      
      Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test`
      
      Reviewed By: jay-zhuang, riversand963
      
      Differential Revision: D37426091
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611
      410ca2ef
    • S
      Reduce overhead of SortFileByOverlappingRatio() (#10161) · 246d4697
      sdong 提交于
      Summary:
      Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161
      
      Test Plan:
      Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:
      
      TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000
      
      The throughput improved by 11%.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37129469
      
      fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d
      246d4697
  10. 24 6月, 2022 5 次提交
    • G
      BlobDB in crash test hitting assertion (#10249) · 052666ae
      Gang Liao 提交于
      Summary:
      This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249
      
      Reviewed By: ltamasi
      
      Differential Revision: D37407576
      
      Pulled By: gangliao
      
      fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69
      052666ae
    • Y
      Fix race condition between file purge and backup/checkpoint (#10187) · 725df120
      Yanqin Jin 提交于
      Summary:
      Resolves https://github.com/facebook/rocksdb/issues/10129
      
      I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to
      separate it from the main part of the PR.
      
      There can be a race condition between two threads. Thread 1 executes
      `DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
      ```
      Time   thread 1                                thread 2
        |  mutex_.lock
        |  read disable_delete_obsolete_files_
        |  ...
        |  wait on log_sync_cv and release mutex_
        |                                          mutex_.lock
        |                                          ++disable_delete_obsolete_files_
        |                                          mutex_.unlock
        |                                          mutex_.lock
        |                                          while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
        |                                          wake up with mutex_ locked
        |                                          compute WALs tracked by MANIFEST
        |                                          mutex_.unlock
        |  wake up with mutex_ locked
        |  ++pending_purge_obsolete_files_
        |  mutex_.unlock
        |
        |  delete obsolete WAL
        |                                          WAL missing but tracked in MANIFEST.
        V
      ```
      
      The fix proposed eliminates the possibility of the above by increasing
      `pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D37214235
      
      Pulled By: riversand963
      
      fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
      725df120
    • M
      Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215) · 60619057
      Mark Callaghan 提交于
      Summary:
      This provides two things:
      1) Runs a sequence of db_bench tests. This sequence was chosen to provide
      good coverage with less variance.
      2) Makes it easier to do A/B testing for multiple binaries. This combines
      the report.tsv files into summary.tsv to make it easier to compare results
      across multiple binaries.
      
      Example output for 2) is:
      
      ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
      1115171 446.7   9GB             8.9     1.0     454.7   26      26      0       0       0.9     0.5     2       7       51      5547    20      0.0     0       0.1     0.1     0.2     fillseq.wal_disabled.v400       2022-04-12T08:53:51     6.0
      1045726 418.9   8GB     0.0GB   8.4     1.0     432.4   27      26      0       0       1.0     0.5     2       6       102     5618    20      0.0     0       0.1     0.0     0.1     fillseq.wal_disabled.v400       2022-04-12T12:25:36     6.28
      
      ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
      2969192 1189.3  16GB            0.0             0.0     0       0       0       0       10.8    9.3     25      33      49      13551   1781    0.0     0       48.2    6.8     16.8    readrandom.t32  2022-04-12T08:54:28     6.0
      2692922 1078.6  16GB    0.0GB   0.0             0.0     0       0       0       0       11.9    10.2    30      38      56      49735   1781    0.0     0       47.8    6.7     16.8    readrandom.t32  2022-04-12T12:26:15     6.28
      
      ...
      
      ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
      180227  72.2    38GB            1126.4  8.7     643.2   3286    3218    0       0       177.6   50.2    2687    4083    6148    854083  1793    68.4    7804    17.0    5.9     0.5     overwrite.t32.s0        2022-04-12T11:55:21     6.0
      236512  94.7    31GB    0.0GB   1502.9  8.9     862.2   5242    5125    0       0       135.3   59.9    2537    3268    5404    18545   1785    49.7    5112    25.5    8.0     9.4     overwrite.t32.s0        2022-04-12T15:27:25     6.28
      
      Example output with formatting preserved is here:
      https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215
      
      Test Plan: run it
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37299892
      
      Pulled By: mdcallag
      
      fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8
      60619057
    • Y
      Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175) · 2a3792ed
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add suggest_compact_range() and suggest_compact_range_cf() to C API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175
      
      Test Plan:
      As verifying the result requires SyncPoint, which is not available in the c_test.c,
      the test is currently done by invoking the functions and making sure it does not crash.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37305191
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
      2a3792ed
    • Z
      Cut output files at compaction cursors (#10227) · 17a1d65e
      zczhu 提交于
      Summary:
      The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227
      
      Test Plan:
      Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`
      
      Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37388088
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
      17a1d65e