1. 12 7月, 2022 1 次提交
  2. 09 7月, 2022 2 次提交
  3. 08 7月, 2022 1 次提交
  4. 07 7月, 2022 8 次提交
    • G
      Eliminate the copying of blobs when serving reads from the cache (#10297) · c987eb47
      Gang Liao 提交于
      Summary:
      The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)
      
      This has the potential to save a lot of CPU, especially with large blob values.
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297
      
      Reviewed By: riversand963
      
      Differential Revision: D37640311
      
      Pulled By: gangliao
      
      fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
      c987eb47
    • G
      Midpoint insertions in ClockCache (#10305) · c277aeb4
      Guido Tagliavini Ponce 提交于
      Summary:
      When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37607787
      
      Pulled By: guidotag
      
      fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040
      c277aeb4
    • Z
      Replace the output split key with its pointer in subcompaction (#10316) · 8debfe2b
      zczhu 提交于
      Summary:
      Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316
      
      Reviewed By: ajkr
      
      Differential Revision: D37661492
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f
      8debfe2b
    • P
      Have Cache use Status::MemoryLimit (#10262) · e6c5e0ab
      Peter Dillinger 提交于
      Summary:
      I noticed it would clean up some things to have Cache::Insert()
      return our MemoryLimit Status instead of Incomplete for the case in
      which the capacity limit is reached. I suspect this fixes some existing but
      unknown bugs where this Incomplete could be confused with other uses
      of Incomplete, especially no_io cases. This is the most suspicious case I
      noticed, but was not able to reproduce a bug, in part because the existing
      code is not covered by unit tests (FIXME added): https://github.com/facebook/rocksdb/blob/57adbf0e9187331cb39bf5cdb5f5d67faeee5f63/table/get_context.cc#L397
      
      I audited all the existing uses of IsIncomplete and updated those that
      seemed relevant.
      
      HISTORY updated with a clear warning to users of strict_capacity_limit=true
      to update uses of `IsIncomplete()`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
      
      Test Plan: updated unit tests
      
      Reviewed By: hx235
      
      Differential Revision: D37473155
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
      e6c5e0ab
    • M
      Allow user to pass git command to makefile (#10318) · 071fe39c
      Manuel Ung 提交于
      Summary:
      This allows users to pass their git command with extra options if necessary.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10318
      
      Reviewed By: ajkr
      
      Differential Revision: D37661175
      
      Pulled By: lth
      
      fbshipit-source-id: 2a7cf27626c74f167471e6ec57e3870630a582b0
      071fe39c
    • A
      Provide support for direct_reads with async_io (#10197) · 2acbf386
      Akanksha Mahajan 提交于
      Summary:
      Provide support for use_direct_reads with async_io.
      
      TestPlan:
      -  Updated unit tests
      -  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
      - db_stress
      ```
      export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
      make crash_test -j
      ```
      - Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197
      
      Reviewed By: anand1976
      
      Differential Revision: D37255646
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
      2acbf386
    • M
      Set the value for --version, add --build_info (#10275) · 177b2fa3
      Mark Callaghan 提交于
      Summary:
      ./db_bench --version
      db_bench version 7.5.0
      
      ./db_bench --build_info
       (RocksDB) 7.5.0
          rocksdb_build_date: 2022-06-29 09:58:04
          rocksdb_build_git_sha: d96febee
          rocksdb_build_git_tag: print_version_githash
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10275
      
      Test Plan: run it
      
      Reviewed By: ajkr
      
      Differential Revision: D37524720
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 0f6c819dbadf7b033a4a3ba2941992bb76b4ff99
      177b2fa3
    • C
      Updated NewDataBlockIterator to not fetch compression dict for non-da… (#10310) · f9cfc6a8
      Changyu Bi 提交于
      Summary:
      …ta blocks
      
      During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310
      
      Test Plan:
      1. `make check`
      2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark.
      For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench.
      
      ```
      # Set up for partitioned index with dictionary compression
      TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true  -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400
      
      # Pre PR
      TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
      readrandom [AVG    50 runs] : 143950 (± 1108) ops/sec;   15.9 (± 0.1) MB/sec
      readrandom [MEDIAN 50 runs] : 144406 ops/sec;   16.0 MB/sec
      
      # Post PR
      TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
      readrandom [AVG    50 runs] : 146176 (± 1121) ops/sec;   16.2 (± 0.1) MB/sec
      readrandom [MEDIAN 50 runs] : 146014 ops/sec;   16.2 MB/sec
      
      # Set up for no partitioned index and no dictionary compression
      TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false
      # Pre PR
      TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]"
      readrandom [AVG    50 runs] : 158546 (± 1000) ops/sec;   17.5 (± 0.1) MB/sec
      readrandom [MEDIAN 50 runs] : 158280 ops/sec;   17.5 MB/sec
      
      # Post PR
      TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]"
      readrandom [AVG    50 runs] : 161061 (± 1520) ops/sec;   17.8 (± 0.2) MB/sec
      readrandom [MEDIAN 50 runs] : 161596 ops/sec;   17.9 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D37631358
      
      Pulled By: cbi42
      
      fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9
      f9cfc6a8
  5. 06 7月, 2022 5 次提交
    • C
      Handoff checksum during WAL replay (#10212) · 0ff77131
      Changyu Bi 提交于
      Summary:
      Added checksum protection for write batch content read from WAL to when per key-value checksum is computed on the write batch. This gives full coverage on write batch integrity of WAL replay to memtable.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10212
      
      Test Plan:
      - Added unit test and the existing tests (replay code path covers the change in this PR): `make -j32 check`
      - Stress test: ran `db_stress` for 30min.
      - Perf regression:
      ```
      # setup
      TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000
      # benchmark db open time
      TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true
      
      For 20 runs, pre-PR avg: 3734.31ms, post-PR avg: 3790.06 ms (~1.5% regression).
      
      Pre-PR
      OpenDb:     3714.36 milliseconds
      OpenDb:     3622.71 milliseconds
      OpenDb:     3591.17 milliseconds
      OpenDb:     3674.7 milliseconds
      OpenDb:     3615.79 milliseconds
      OpenDb:     3982.83 milliseconds
      OpenDb:     3650.6 milliseconds
      OpenDb:     3809.26 milliseconds
      OpenDb:     3576.44 milliseconds
      OpenDb:     3638.12 milliseconds
      OpenDb:     3845.68 milliseconds
      OpenDb:     3677.32 milliseconds
      OpenDb:     3659.64 milliseconds
      OpenDb:     3837.55 milliseconds
      OpenDb:     3899.64 milliseconds
      OpenDb:     3840.72 milliseconds
      OpenDb:     3802.71 milliseconds
      OpenDb:     3573.27 milliseconds
      OpenDb:     3895.76 milliseconds
      OpenDb:     3778.02 milliseconds
      
      Post-PR:
      OpenDb:     3880.46 milliseconds
      OpenDb:     3709.02 milliseconds
      OpenDb:     3954.67 milliseconds
      OpenDb:     3955.64 milliseconds
      OpenDb:     3958.64 milliseconds
      OpenDb:     3631.28 milliseconds
      OpenDb:     3721 milliseconds
      OpenDb:     3729.89 milliseconds
      OpenDb:     3730.55 milliseconds
      OpenDb:     3966.32 milliseconds
      OpenDb:     3685.54 milliseconds
      OpenDb:     3573.17 milliseconds
      OpenDb:     3703.75 milliseconds
      OpenDb:     3873.62 milliseconds
      OpenDb:     3704.4 milliseconds
      OpenDb:     3820.98 milliseconds
      OpenDb:     3721.62 milliseconds
      OpenDb:     3770.86 milliseconds
      OpenDb:     3949.78 milliseconds
      OpenDb:     3760.07 milliseconds
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D37302092
      
      Pulled By: cbi42
      
      fbshipit-source-id: 7346e625f453ce4c0e5d708776cd1fb2af6b068b
      0ff77131
    • Y
      Expand stress test coverage for user-defined timestamp (#10280) · caced09e
      Yanqin Jin 提交于
      Summary:
      Before this PR, we call `now()` to get the wall time before performing point-lookup and range
      scans when user-defined timestamp is enabled.
      
      With this PR, we expand the coverage to:
      - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now()
      - add coverage for `ReadOptions::iter_start_ts != nullptr`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280
      
      Test Plan:
      ```bash
      make check
      ```
      
      Also,
      ```bash
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
      ```
      
      So far, we have had four successful runs of the above
      
      In addition,
      ```bash
      TEST_TMPDIR=/dev/shm/rocksdb make crash_test
      ```
      Succeeded twice showing no regression.
      
      Reviewed By: ltamasi
      
      Differential Revision: D37539805
      
      Pulled By: riversand963
      
      fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5
      caced09e
    • M
      Add the git hash and full RocksDB version to report.tsv (#10277) · 9eced1a3
      Mark Callaghan 提交于
      Summary:
      Previously the version was displayed as $major.$minor
      This changes it to $major.$minor.$path
      
      This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
      who consume/graph these results are OK with it.
      
      Example output:
      ops_sec	mb_sec	lsm_sz	blob_sz	c_wgb	w_amp	c_mbps	c_wsecs	c_csecs	b_rgb	b_wgb	usec_op	p50	p99	p99.9	p99.99	pmax	uptime	stall%	Nstall	u_cpu	s_cpu	rss	test	date	version	job_id	githash
      609488	244.1	1GB	0.0GB,	1.4	0.7	93.3	39	38	0	0	1.6	1.0	4	15	26	5365	15	0.0	0	0.1	0.0	0.5	fillseq.wal_disabled.v400	2022-06-29T13:36:05	7.5.0		61152544
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277
      
      Test Plan: Run it
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37532418
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
      9eced1a3
    • S
      Try to trivial move more than one files (#10190) · a9565ccb
      sdong 提交于
      Summary:
      In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.
      
      When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190
      
      Test Plan:
      Add some unit tests.
      For performance, Try to run
      ./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
      Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37230647
      
      fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
      a9565ccb
    • Y
      Update code comment and logging for secondary instance (#10260) · d6b9c4ae
      Yanqin Jin 提交于
      Summary:
      Before this PR, it is required that application open RocksDB secondary
      instance with `max_open_files = -1`. This is a hacky workaround that
      prevents IOErrors on the seconary instance during point-lookup or range
      scan caused by primary instance deleting the table files. This is not
      necessary if the application can coordinate the primary and secondaries
      so that primary does not delete files that are still being used by the
      secondaries. Or users can provide a custom Env/FS implementation that
      deletes the files only after all primary and secondary instances
      indicate files are obsolete and deleted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260
      
      Test Plan: make check
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37462633
      
      Pulled By: riversand963
      
      fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c
      d6b9c4ae
  6. 04 7月, 2022 1 次提交
  7. 02 7月, 2022 1 次提交
    • G
      Fix CalcHashBits (#10295) · 54f678cd
      Guido Tagliavini Ponce 提交于
      Summary:
      We fix two bugs in CalcHashBits. The first one is an off-by-one error: the desired number of table slots is the real number ``capacity / (kLoadFactor * handle_charge)``, which should not be rounded down. The second one is that we should disallow inputs that set the element charge to 0, namely ``estimated_value_size == 0 && metadata_charge_policy == kDontChargeCacheMetadata``.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10295
      
      Test Plan: CalcHashBits is tested by CalcHashBitsTest (in lru_cache_test.cc). The test now iterates over many more inputs; it covers, in particular, the rounding error edge case. Overall, the test is now more robust. Run ``make -j24 check``.
      
      Reviewed By: pdillinger
      
      Differential Revision: D37573797
      
      Pulled By: guidotag
      
      fbshipit-source-id: ea4f4439f7196ab1c1afb88f566fe92850537262
      54f678cd
  8. 01 7月, 2022 10 次提交
    • Z
      Add FLAGS_compaction_pri into crash_test (#10255) · e716bda0
      zczhu 提交于
      Summary:
      Add FLAGS_compaction_pri into correctness test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255
      
      Test Plan: run crash_test with FLAGS_compaction_pri
      
      Reviewed By: ajkr
      
      Differential Revision: D37510372
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641
      e716bda0
    • A
      Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292) · 11215e0f
      Akanksha Mahajan 提交于
      Summary:
      If dbname and db_log_dir are at different filesystems (one
      local and one remote), creation of dbname will fail because that path
      doesn't exist wrt to db_log_dir.
      This patch will ignore the error returned on creation of dbname. If they
      are on same filesystem, db_log_dir creation will automatically return
      the error in case there is any error in creation of dbname.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292
      
      Test Plan: Existing unit tests
      
      Reviewed By: riversand963
      
      Differential Revision: D37567773
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100
      11215e0f
    • S
      Multi-File Trivial Move in L0->L1 (#10188) · 4428c761
      sdong 提交于
      Summary:
      In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
      1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
      2. Modify the trivial move condition so that this compaction would be tagged as trivial move.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188
      
      Test Plan:
      See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:
      
      ./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
      
      The throughput improved by about 50%. Stalling still happens though.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37224743
      
      fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
      4428c761
    • Z
      Remove compact cursor when split sub-compactions (#10289) · 4f51101d
      zczhu 提交于
      Summary:
      In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here.  The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289
      
      Reviewed By: ajkr
      
      Differential Revision: D37559091
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc
      4f51101d
    • M
      Add undefok for BlobDB options not supported prior to 7.5 (#10276) · 720ab355
      Mark Callaghan 提交于
      Summary:
      This adds --undefok to support use of this script with BlobDB for db_bench versions prior
      to 7.5 when the options land in a release.
      
      While there is a limit to how far back this script can go WRT backwards compatiblity,
      this is an easy change to support early 7.x releases.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276
      
      Test Plan: Run it with versions of db_bench that do not and then do support these options
      
      Reviewed By: gangliao
      
      Differential Revision: D37529299
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
      720ab355
    • S
      Change The Way Level Target And Compaction Score Are Calculated (#10057) · b397dcd3
      sdong 提交于
      Summary:
      The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
      Basic idea:
      (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
      (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057
      
      Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.
      
      Reviewed By: ajkr
      
      Differential Revision: D37539742
      
      fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
      b397dcd3
    • G
      Enable blob caching for MultiGetBlob in RocksDB (#10272) · 056e08d6
      Gang Liao 提交于
      Summary:
      - [x] Enabled blob caching for MultiGetBlob in RocksDB
      - [x] Refactored MultiGetBlob logic and interface in RocksDB
      - [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
      - [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test
      
      This task is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272
      
      Reviewed By: ltamasi
      
      Differential Revision: D37558112
      
      Pulled By: gangliao
      
      fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
      056e08d6
    • A
      include compaction cursors in VersionEdit debug string (#10288) · 20754b36
      Andrew Kryczka 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10288
      
      Test Plan:
      try it out -
      
      ```
      $ ldb manifest_dump --db=/dev/shm/rocksdb.0uWV/rocksdb_crashtest_whitebox/ --hex --verbose | grep CompactCursor | head -3
        CompactCursor: 1 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1
        CompactCursor: 1 '0000000000001F35000000000000012B0000000000000022' seq:0, type:1
        CompactCursor: 2 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1
      ```
      
      Reviewed By: littlepig2013
      
      Differential Revision: D37557177
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7b76b857d9e7a9f3d53398a61bb1d4b78873b91e
      20754b36
    • Y
      Add load_latest_options() to C api (#10152) · 17a6f7fa
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add load_latest_options() to C api.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152
      
      Test Plan:
      Extend the existing c_test by reopening db using the latest options file
      at different parts of the test.
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37305225
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962
      17a6f7fa
    • Y
      Fix assertion error with read_opts.iter_start_ts (#10279) · b87c3557
      Yanqin Jin 提交于
      Summary:
      If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure.
      Test plan
      make check
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279
      
      Reviewed By: ltamasi
      
      Differential Revision: D37539393
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315
      b87c3557
  9. 30 6月, 2022 5 次提交
    • G
      Clock cache (#10273) · 57a0e2f3
      Guido Tagliavini Ponce 提交于
      Summary:
      This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
      - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
      - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
      - Middle insertions into the clock list.
      - A test that exercises the clock eviction policy.
      - Update the Java API of ClockCache and Java calls to C++.
      
      Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
      
      Test Plan: ``make -j24 check``
      
      Reviewed By: pdillinger
      
      Differential Revision: D37522461
      
      Pulled By: guidotag
      
      fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
      57a0e2f3
    • J
      Fix GetWindowsErrSz nullptr bug (#10282) · c2dc4c0c
      Johnny Shaw 提交于
      Summary:
      `GetWindowsErrSz` may assign a `nullptr` to `std::string` in the event it cannot format the error code to a string. This will result in a crash when `std::string` attempts to calculate the length from `nullptr`.
      
      The change here checks the output from `FormatMessageA` and only assigns to the otuput `std::string` if it is not null. Additionally, the call to free the buffer is only made if a non-null value is returned from `FormatMessageA`. In the event `FormatMessageA` does not output a string, an empty string is returned instead.
      
      Fixes https://github.com/facebook/rocksdb/issues/10274
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10282
      
      Reviewed By: riversand963
      
      Differential Revision: D37542143
      
      Pulled By: ajkr
      
      fbshipit-source-id: c21f5119ddb451f76960acec94639d0f538052f2
      c2dc4c0c
    • L
      WriteBatch reorder fields to reduce padding (#10266) · 490fcac0
      leipeng 提交于
      Summary:
      this reorder reduces sizeof(WriteBatch) by 16 bytes
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10266
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37505201
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6cb6c3735073fcb63921f822d5e15670fecb1c26
      490fcac0
    • S
      Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270) · 61152544
      sdong 提交于
      Summary:
      Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.
      
      Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270
      
      Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.
      
      Reviewed By: ajkr
      
      Differential Revision: D37497327
      
      fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
      61152544
    • E
      Expose LRU cache num_shard_bits paramater in C api (#10222) · 12bfd519
      Edvard Davtyan 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10222
      
      Reviewed By: cbi42
      
      Differential Revision: D37358171
      
      Pulled By: ajkr
      
      fbshipit-source-id: e86285fdceaec943415ee9d482090009b00cbc95
      12bfd519
  10. 29 6月, 2022 5 次提交
    • M
      Benchmark fix write amplification computation (#10236) · 28f2d3cc
      Mark Callaghan 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236
      
      Reviewed By: ajkr
      
      Differential Revision: D37489898
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145
      28f2d3cc
    • Y
      Support `iter_start_ts` for backward iteration (#10200) · b6cfda12
      Yanqin Jin 提交于
      Summary:
      Resolves https://github.com/facebook/rocksdb/issues/9761
      
      With this PR, applications can create an iterator with the following
      ```cpp
      ReadOptions read_opts;
      read_opts.timestamp = &ts_ub;
      read_opts.iter_start_ts = &ts_lb;
      auto* it = db->NewIterator(read_opts);
      it->SeekToLast();
      // or it->SeekForPrev("foo");
      it->Prev();
      ...
      ```
      The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D37258074
      
      Pulled By: riversand963
      
      fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565
      b6cfda12
    • P
      Update/clarify required properties for prefix extractors (#10245) · d96febee
      Peter Dillinger 提交于
      Summary:
      Most of the properties listed as required for prefix extractors
      are not really required but offer some conveniences. This updates API
      comments to clarify actual requirements, and adds tests to demonstrate
      how previously presumed requirements can be safely violated.
      
      This might seem like a useless exercise, but this relaxing of requirements
      would be needed if we generalize prefixes to group keys not just at the
      byte level but also based on bits or arbitrary value ranges. For
      applications without a "natural" prefix size, having only byte-level
      granularity often means one prefix size to the next differs in magnitude
      by a factor of 256.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245
      
      Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244
      
      Reviewed By: bjlemaire
      
      Differential Revision: D37371559
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244
      d96febee
    • A
      Deflake RateLimiting/BackupEngineRateLimitingTestWithParam (#10271) · ca81b80d
      Andrew Kryczka 提交于
      Summary:
      We saw flakes with the following failure:
      
      ```
      [ RUN      ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
      utilities/backup/backup_engine_test.cc:2667: Failure
      Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
      terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
      what():  utilities/backup/backup_engine_test.cc:2667: Failure
      Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
      Received signal 6 (Aborted)
      t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted                 (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
      ```
      
      Investigation revealed we forgot to use the mock time `SystemClock` for
      restore rate limiting. Then the test used wall clock time, which made
      the execution of "GenericRateLimiter::Request:PostTimedWait"
      non-deterministic as wall clock time might have advanced enough that
      waiting was not needed.
      
      This PR changes restore rate limiting to use
      mock time, which guarantees we always execute
      "GenericRateLimiter::Request:PostTimedWait". Then the assertions that
      rely on times recorded inside that callback should be robust.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271
      
      Test Plan:
      Applied the following patch which guaranteed repro before the fix.
      Verified the test passes after this PR even with that patch applied.
      
      ```
       diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc
      index f369e3220..6b3ed82fa 100644
       --- a/util/rate_limiter.cc
      +++ b/util/rate_limiter.cc
      @@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) {
      
       void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri,
                                        Statistics* stats) {
      +  usleep(100000);
         assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed));
         bytes = std::max(static_cast<int64_t>(0), bytes);
         TEST_SYNC_POINT("GenericRateLimiter::Request");
      ```
      
      Reviewed By: hx235
      
      Differential Revision: D37499848
      
      Pulled By: ajkr
      
      fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e
      ca81b80d
    • G
      Add blob cache tickers, perf context statistics, and DB properties (#10203) · d7ebb58c
      Gang Liao 提交于
      Summary:
      In order to be able to monitor the performance of the new blob cache, we made the follow changes:
      - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
      - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
      - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.
      
      This PR is a part of https://github.com/facebook/rocksdb/issues/10156
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203
      
      Reviewed By: ltamasi
      
      Differential Revision: D37478658
      
      Pulled By: gangliao
      
      fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40
      d7ebb58c
  11. 28 6月, 2022 1 次提交