- 02 5月, 2023 1 次提交
-
-
由 Andrew Kryczka 提交于
Summary: RocksDB's jemalloc no-dump allocator (`NewJemallocNodumpAllocator()`) was using a single manual arena. This arena's lock contention could be very high when thread caching is disabled for RocksDB blocks (e.g., when using `MALLOC_CONF='tcache_max:4096'` and `rocksdb_block_size=16384`). This PR changes the jemalloc no-dump allocator to use a configurable number of manual arenas. That number is required to be a power of two so we can avoid division. The allocator shards allocation requests randomly across those manual arenas. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11400 Test Plan: - mysqld setup - Branch: fb-mysql-8.0.28 (https://github.com/facebook/mysql-5.6/commit/653eba2e56cfba4eac0c851ac9a70b2da9607527) - Build: `mysqlbuild.sh --clean --release` - Set env var `MALLOC_CONF='tcache_max:$tcache_max'` - Added CLI args `--rocksdb_cache_dump=false --rocksdb_block_cache_size=4294967296 --rocksdb_block_size=16384` - Ran under /usr/bin/time - Large database scenario - Setup command: `mysqlslap -h 127.0.0.1 -P 13020 --auto-generate-sql=1 --auto-generate-sql-load-type=write --auto-generate-sql-guid-primary=1 --number-char-cols=8 --auto-generate-sql-execute-number=262144 --concurrency=32 --no-drop` - Benchmark command: `mysqlslap -h 127.0.0.1 -P 13020 --query='select count(*) from mysqlslap.t1;' --number-of-queries=320 --concurrency=32` - Results: | tcache_max | num_arenas | Peak RSS MB (% change) | Query latency seconds (% change) | |---|---|---|---| | 4096 | **(baseline)** | 4541 | 37.1 | | 4096 | 1 | 4535 (-0.1%) | 36.7 (-1%) | | 4096 | 8 | 4687 (+3%) | 10.2 (-73%) | | 16384 | **(baseline)** | 4514 | 8.4 | | 16384 | 1 | 4526 (+0.3%) | 8.5 (+1%) | | 16384 | 8 | 4580 (+1%) | 8.5 (+1%) | Reviewed By: pdillinger Differential Revision: D45220794 Pulled By: ajkr fbshipit-source-id: 9a50c9872bdef5d299e52b115a65ee8a5557d58d
-
- 29 4月, 2023 1 次提交
-
-
由 Levi Tamasi 提交于
Summary: The old `StackableDB` based BlobDB implementation relies on a DB listener to track the total size of the SST files in the database and to trigger FIFO eviction. Some test cases in `BlobDBTest` assume that the listener is notified by the time `DB::Flush` returns, which is not guaranteed (side note: `TEST_WaitForFlushMemTable` would not guarantee this either). The patch fixes these tests by using `SyncPoint`s to make sure the listener is actually called before verifying the FIFO behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11417 Test Plan: ``` make -j56 COERCE_CONTEXT_SWITCH=1 blob_db_test ./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_TriggerOnSSTSizeChange ./blob_db_test --gtest_filter=BlobDBTest.FilterForFIFOEviction ./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_NoEnoughBlobFilesToEvict ``` Reviewed By: ajkr Differential Revision: D45407135 Pulled By: ltamasi fbshipit-source-id: fcd63d76937d2c975f569a6635ce8730772a3d75
-
- 26 4月, 2023 2 次提交
-
-
由 Changyu Bi 提交于
Summary: add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are 1. checksum construction and verification in block.cc/h 2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h) 3. unit tests/crash test updates Tests: * Added unit tests * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576` Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled. Performance: Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory. For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates): ``` SETUP make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none BENCHMARK ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE The readrandom ops/sec looks like the following: Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB Main 240805 223604 198176 161653 139040 PR prot_bytes=0 238691 226693 200127 161082 141153 PR prot_bytes=1 214983 193199 178532 137013 108211 prot_bytes=1 vs -10% -15% -10.8% -15% -23% prot_bytes=0 ``` The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287 Reviewed By: ajkr Differential Revision: D43970708 Pulled By: cbi42 fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
-
由 leipeng 提交于
Summary: In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395 Reviewed By: ajkr Differential Revision: D45240896 Pulled By: cbi42 fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31
-
- 25 4月, 2023 3 次提交
-
-
由 Hui Xiao 提交于
Summary: **Context/Summary:** To better utilize `DbStressFSWrapper` for some assertion, https://github.com/facebook/rocksdb/pull/11288 removed an io_uring stress test hack for POSIX FS not supporting read async added in https://github.com/facebook/rocksdb/pull/11242 = It was removed based on the assumption that a later PR https://github.com/facebook/rocksdb/pull/11296 is sufficient to serve as an alternative workaround. But recent stress tests has shown the opposite, mostly because 11296 approach might be subjected to incompleteness when more `ReadOptions` are passed down as what https://github.com/facebook/rocksdb/pull/11288 has done. As a short-term solution to both work around POSIX FS constraint above and utilize `DbStressFSWrapper` for 11288 assertion, I proposed this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11404 Test Plan: - Stress test ensures 11288's assertion is still effective in `DbStressFSWrapper` ``` ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=disable --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=32767 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=$exp --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=1 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35 ``` - Monitor future stress test to show `MultiGet error: Not implemented: ReadAsync` is gone Reviewed By: ltamasi Differential Revision: D45242280 Pulled By: hx235 fbshipit-source-id: 9823e3fbd4e9672efdd31478a2f2cbd68a98bdf5
-
由 Peter Dillinger 提交于
Summary: Update and clean up history. Update version number. Add to compatibility test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11405 Reviewed By: ltamasi Differential Revision: D45242779 Pulled By: pdillinger fbshipit-source-id: 860bd047584d051472ba9ccefae7ebc3f37b1d8f
-
由 Peter Dillinger 提交于
Summary: This time a particular version of bzip2 is under-compressing vs. expectation in BlockBasedTableTest.CompressionRatioThreshold. We'll exempt that algorithm like I did for DBStatisticsTest.CompressionStatsTest. https://app.circleci.com/pipelines/github/facebook/rocksdb/26869/workflows/a46246db-73c7-4946-af82-10a78a7df6af/jobs/596124 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11403 Test Plan: CI Reviewed By: ltamasi Differential Revision: D45233441 Pulled By: pdillinger fbshipit-source-id: 506c8dfe5e0397c78193359df6288397bf0667c9
-
- 23 4月, 2023 1 次提交
-
-
由 Peter Dillinger 提交于
Summary: Tweak some bounds and things, and reduce risk of surprise results by running on all supported compressions (mostly). Also improves the precise compressibility of CompressibleString by using RandomBinaryString. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11396 Test Plan: updated tests Reviewed By: ltamasi Differential Revision: D45211938 Pulled By: pdillinger fbshipit-source-id: 9dc1dd8574a60a9364efe18558be66d31a35598b
-
- 22 4月, 2023 3 次提交
-
-
由 Peter Dillinger 提交于
Summary: ## Option API updates * Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7. * Remove unnecessary constructor for CompressionOptions. * Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated). ## Stat API updates * Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed. * Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing. * New or existing tickers relevant to compression: * BYTES_COMPRESSED_FROM * BYTES_COMPRESSED_TO * BYTES_COMPRESSION_BYPASSED * BYTES_COMPRESSION_REJECTED * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing) * NUMBER_BLOCK_COMPRESSED (existing) * NUMBER_BLOCK_COMPRESSION_BYPASSED * NUMBER_BLOCK_COMPRESSION_REJECTED * BYTES_DECOMPRESSED_FROM * BYTES_DECOMPRESSED_TO We can compute a number of things with these stats: * "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO * Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED) * Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED * Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression. ## BlockBasedTableBuilder Various small refactorings, optimizations, and name clean-ups. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388 Test Plan: unit tests added * `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field. Reviewed By: ajkr Differential Revision: D45128202 Pulled By: pdillinger fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
-
由 Changyu Bi 提交于
Summary: This happens when the persisted merge operator not a RocksDB built-in one. This PR improves this error message to include the actual persisted merge operator name. when there is a merge_operator mismatch in `SanityCheckCFOptions()`, for example, going from merge operator "CustomMergeOp" to nullptr, an error message like the following is returned: "failed the verification on ColumnFamilyOptions::merge_operator--- The specified one is nullptr while the **persisted one is nullptr**." This happens when the persisted merge operator not a RocksDB built-in one. This PR improves this error message to include the actual persisted merge operator name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11393 Test Plan: add unit test to check error message when going from merge op -> nullptr and going from merge op1 to merge op 2. Reviewed By: ajkr Differential Revision: D45190131 Pulled By: cbi42 fbshipit-source-id: 67712c2fec29c654c15166d1be985e710e6081e5
-
由 Hui Xiao 提交于
Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
-
- 21 4月, 2023 3 次提交
-
-
由 Andrew Kryczka 提交于
Summary: As titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/11390 Reviewed By: cbi42 Differential Revision: D45148830 Pulled By: ajkr fbshipit-source-id: 9a8dfd040514bae3d8ed9e97a79cae7683f2749a
-
由 Andrew Kryczka 提交于
Summary: The old cleanup code had a race condition: 1. Test thread: DestroyDB() marked a file as trash 2. DeleteScheduler thread: Got the file's size and decided to delete it in chunks 3. Test thread: DestroyDir() deleted that trash file 4. DeleteScheduler thread: Began deleting in chunks starting by calling ReopenWritableFile(). Unfortunately this recreates the deleted trash file 5. Test thread: DestroyDir() fails to remove the parent directory because it contains the file created in 4. 6. Test thread: Checkpoint::Create() fails due to the directory already existing It could be repro'd with the following patch/command. Patch: ``` diff --git a/file/delete_scheduler.cc b/file/delete_scheduler.cc index 8a2d1615d..337d24a60 100644 --- a/file/delete_scheduler.cc +++ b/file/delete_scheduler.cc @@ -317,6 +317,12 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash, &num_hard_links, nullptr); if (my_status.ok()) { if (num_hard_links == 1) { + // Give some time for DestroyDir() to delete file entries. Then, the + // below `ReopenWritableFile()` will recreate files, preventing the + // parent directory from being deleted. + if (rand() % 2 == 0) { + usleep(1000); + } std::unique_ptr<FSWritableFile> wf; my_status = fs_->ReopenWritableFile(path_in_trash, FileOptions(), &wf, nullptr); diff --git a/file/file_util.cc b/file/file_util.cc index 43608fcdc..2cee1ad8e 100644 --- a/file/file_util.cc +++ b/file/file_util.cc @@ -263,6 +263,13 @@ Status DestroyDir(Env* env, const std::string& dir) { } } + // Give some time for the DeleteScheduler thread's ReopenWritableFile() to + // recreate deleted files + if (dir.find("checkpoint") != std::string::npos) { + fprintf(stderr, "waiting to destroy %s\n", dir.c_str()); + usleep(10000); + } + if (s.ok()) { s = env->DeleteDir(dir); // DeleteDir might or might not report NotFound ``` Command: ``` TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=131072 --target_file_size_base=131072 --max_bytes_for_level_base=524288 --checkpoint_one_in=100 --clear_column_family_one_in=0 --max_key=1000 --value_size_mult=33 --sst_file_manager_bytes_per_truncate=4096 --sst_file_manager_bytes_per_sec=1048576 --interval=3 --compression_type=none --sync_fault_injection=1 ``` Obviously we don't want to use scheduled deletion here as we need the checkpoint directory deleted immediately. I suspect the DestroyDir() was an attempt to fixup incomplete DestroyDB()s. Now that we expect DestroyDB() to be complete I removed that code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11389 Reviewed By: hx235 Differential Revision: D45137142 Pulled By: ajkr fbshipit-source-id: 2af743d342c77cc414fd25fc4c9d7c9c6079ad24
-
由 Changyu Bi 提交于
Summary: during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr), - before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move), - after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction. Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved. This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375 Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed. Reviewed By: ajkr Differential Revision: D44943518 Pulled By: cbi42 fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
-
- 19 4月, 2023 1 次提交
-
-
由 Andrew Kryczka 提交于
Summary: build-windows-vs2022 jobs (e.g., https://app.circleci.com/pipelines/github/facebook/rocksdb/26641/workflows/7d1c58b8-7dd6-4dd6-a222-ecdfb0892c3b/jobs/593583) began failing with: ``` (CustomBuild target) -> CUSTOMBUILD : error : Source option 7 is no longer supported. Use 8 or later. [C:\Users\circleci.PACKER-64370BA5\project\build\java\rocksdbjni_classes.vcxproj] ``` So, this PR tries setting `-source 8` instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11385 Reviewed By: ltamasi Differential Revision: D45058172 Pulled By: ajkr fbshipit-source-id: b0daa1ea6f576c8417add40bd6c92710d329c44d
-
- 18 4月, 2023 3 次提交
-
-
由 Peter Dillinger 提交于
Summary: Because of this failure with snappy 1.1.8, ROCKSDB_NO_FBCODE=1 ``` Value 3531 is not in range [2000, 3525] table/table_test.cc:4231: Failure ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11384 Test Plan: run updated test in failing configuration Reviewed By: ajkr Differential Revision: D45057161 Pulled By: pdillinger fbshipit-source-id: 397054f08033315e2e2bd9410f1fa32ddbf3b9c8
-
由 Andrew Kryczka 提交于
Summary: This test exhibited the following flaky failure: ``` db/db_write_test.cc:653: Failure db_->Resume() Corruption: Not active ``` I was able to repro it by applying the following patch to coerce a specific race condition: ``` diff --git a/db/db_write_test.cc b/db/db_write_test.cc index d82c57376..775ba3cde 100644 --- a/db/db_write_test.cc +++ b/db/db_write_test.cc @@ -636,6 +636,10 @@ TEST_P(DBWriteTest, LockWALInEffect) { ASSERT_TRUE(dbfull()->WALBufferIsEmpty()); ASSERT_OK(db_->UnlockWAL()); + // Test thread: sleep interval: [0, 3) + // In this interval, the file system is active + sleep(3); + // Fail the WAL flush if applicable fault_fs->SetFilesystemActive(false); Status s = Put("key2", "value"); @@ -649,6 +653,11 @@ TEST_P(DBWriteTest, LockWALInEffect) { ASSERT_OK(db_->LockWAL()); ASSERT_OK(db_->UnlockWAL()); } + + // Test thread: sleep interval: [3, 6) + // In this interval, the file system is inactive + sleep(3); + fault_fs->SetFilesystemActive(true); ASSERT_OK(db_->Resume()); // Writes should work again diff --git a/db/flush_job.cc b/db/flush_job.cc index 8193f594f..602ee2c9f 100644 --- a/db/flush_job.cc +++ b/db/flush_job.cc @@ -979,6 +979,10 @@ Status FlushJob::WriteLevel0Table() { DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced)); } TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table", &mems_); + // Flush thread: sleep interval: [0, 4) + // Upon awakening, the file system will be inactive. Then the MANIFEST + // update will fail. + sleep(4); db_mutex_->Lock(); } base_->Unref(); ``` The fix for this scenario is explained in the code change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11382 Reviewed By: cbi42 Differential Revision: D45027632 Pulled By: ajkr fbshipit-source-id: 6bfa35a5781c0c080fb74e13f2b2c9f871f7effb
-
由 Andrew Kryczka 提交于
Summary: In CircleCI build-linux-arm-test-full job (https://app.circleci.com/pipelines/github/facebook/rocksdb/26462/workflows/a9d39d2c-c970-4b0f-9c10-7743beb9771b/jobs/591722), this test exhibited the following flaky failure: ``` db/db_bloom_filter_test.cc:2506: Failure Expected: (TestGetTickerCount(options, BLOOM_FILTER_USEFUL)) > (65000 * 2), actual: 120558 vs 130000 ``` I ssh'd to an instance and observed it cuts memtables at slightly different points across runs. Logging in `ConcurrentArena` pointed to `try_lock()` returning false at different points across runs. This PR changes the approach to allow a fixed number of keys per memtable flush. I verified the bloom filter useful count is deterministic now even on the CircleCI ARM instance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11383 Reviewed By: cbi42 Differential Revision: D45036829 Pulled By: ajkr fbshipit-source-id: b602dacb63955f1af09bf0ed409cde0552805a08
-
- 16 4月, 2023 2 次提交
-
-
由 Murali Vilayannur 提交于
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11342 Reviewed By: ajkr Differential Revision: D45026838 Pulled By: mnv104 fbshipit-source-id: 099ed9579922b8fa6e7d3332bbb829d50ec47d91
-
由 mayue.fight 提交于
Summary: When calculating the largest_key in ImportColumnFamilyJob::GetIngestedFileInfo, only the first element of range_del_iter is calculated. If range_del_iter has multiple elements, the largest_key will be wrong Pull Request resolved: https://github.com/facebook/rocksdb/pull/11381 Reviewed By: cbi42 Differential Revision: D44981450 Pulled By: ajkr fbshipit-source-id: 584bc7da86295568a96984d2951644f289e578c7
-
- 15 4月, 2023 2 次提交
-
-
由 Changyu Bi 提交于
Summary: Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left. Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347 Test Plan: * new unit test * Benchmark does not show obvious improvement or regression: ``` Write sequentially ./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216 Main: fillseq : 4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations; 23.4 MB/s This PR: fillseq : 4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations; 23.3 MB/s Write randomly ./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216 Main: fillrandom : 16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations; 6.8 MB/s This PR: fillrandom : 15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations; 7.0 MB/s ``` Reviewed By: ajkr Differential Revision: D44645650 Pulled By: cbi42 fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733
-
由 mayue.fight 提交于
Summary: **Context/Summary:** ASSERT_EQ will only verify the code of Status, but will not check the state message of Status. - Assert by checking Status state in `ImportColumnFamilyTest` - Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372 Reviewed By: ajkr Differential Revision: D45004343 Pulled By: cbi42 fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b
-
- 13 4月, 2023 1 次提交
-
-
由 Jeff Palm 提交于
Summary: util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (and enable checking for the error) Code would produce errors like this, when compiled with -Wambiguous-reversed-operator under c++20. ``` util/ribbon_test.cc:695:20: error: ISO C++20 considers use of overloaded operator '!=' (with operand types 'KeyGen' (aka '(anonymous namespace)::StandardKeyGen') and 'KeyGen') to be ambiguou s despite there being a unique best viable function with non-reversed arguments [-Werror,-Wambiguous-reversed-operator] while (cur != batch_end) { ~~~ ^ ~~~~~~~~~ util/ribbon_test.cc:111:8: note: candidate function with non-reversed arguments bool operator!=(const StandardKeyGen& other) { ^ util/ribbon_test.cc:107:8: note: ambiguous candidate function with reversed arguments bool operator==(const StandardKeyGen& other) { ^ ``` This will become a hard error in future standards. Confirmed that no errors were generated when building using clang and c++20: ``` USE_CLANG=1 USE_COROUTINES=1 make ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11371 Reviewed By: meyering Differential Revision: D44921027 Pulled By: cbi42 fbshipit-source-id: ef25b78260920a4d75a718310688d3a2487ffa87
-
- 12 4月, 2023 1 次提交
-
-
由 Yu Zhang 提交于
Summary: This option is immutable through the life time of the DB open. For now, updating its value between different DB open sessions is also a non compatible change. When I work on support for updating comparator, the type of updates accepted for this option will be supported then. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11362 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D44873870 Pulled By: jowlyzhang fbshipit-source-id: aa02094754b58d99abf9af4c9a8108c1350254cb
-
- 11 4月, 2023 2 次提交
-
-
由 Andrew Kryczka 提交于
Summary: Fixed the following failure: ``` third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc: In function ‘bool testing::internal::StackGrowsDown()’: third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8681:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized] 8681 | StackLowerThanAddress(&dummy, &result); | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~ third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8671:13: note: by argument 1 of type ‘const void*’ to ‘void testing::internal::StackLowerThanAddress(const void*, bool*)’ declared here 8671 | static void StackLowerThanAddress(const void* ptr, bool* result) { | ^~~~~~~~~~~~~~~~~~~~~ third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8679:7: note: ‘dummy’ declared here 8679 | int dummy; | ^~~~~ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11361 Reviewed By: cbi42 Differential Revision: D44838033 Pulled By: ajkr fbshipit-source-id: 27d68b5a24a15723bbaaa7de45ccd70a60fe259e
-
由 Niklas Fiekas 提交于
Summary: Makes it easier to use generated Rust bindings. Constness of these is already part of the C++ API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11243 Reviewed By: hx235 Differential Revision: D44840394 Pulled By: ajkr fbshipit-source-id: bcd1aeb8c959c304148d25b00043bb8c4cd3e0a4
-
- 08 4月, 2023 7 次提交
-
-
由 Zdenek Korcak 提交于
Summary: copy paste typo Pull Request resolved: https://github.com/facebook/rocksdb/pull/11325 Reviewed By: hx235 Differential Revision: D44378512 Pulled By: ajkr fbshipit-source-id: 509ed2697c06eed975914359ece0459a0ea40312
-
由 nccx 提交于
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11357 Reviewed By: hx235 Differential Revision: D44774454 Pulled By: ajkr fbshipit-source-id: f3912316b6cd4e0b41310590c93f914f1d943044
-
由 leipeng 提交于
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11348 Reviewed By: hx235 Differential Revision: D44774863 Pulled By: ajkr fbshipit-source-id: ba4bd959650228a71fca6bf62840ae9d7373d6f0
-
由 nccx 提交于
Summary: The CI systems other than CircleCI are almost always in a failing state. Since CircleCI covers linux, macos, and windows, we can remove the others. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11354 Reviewed By: hx235 Differential Revision: D44774627 Pulled By: ajkr fbshipit-source-id: c83b298ec5afe4ea410744eda6cc98fc6a3365f1
-
由 Changyu Bi 提交于
Summary: valgrind complains "Conditional jump or move depends on uninitialised value(s)". A sample error message: ``` [ RUN ] DBCompactionTest.DrainUnnecessaryLevelsAfterDBBecomesSmall ==3353864== Conditional jump or move depends on uninitialised value(s) ==3353864== at 0x8647B4: rocksdb::VersionStorageInfo::ComputeCompactionScore(rocksdb::ImmutableOptions const&, rocksdb::MutableCFOptions const&) (version_set.cc:3414) ==3353864== by 0x86B340: rocksdb::VersionSet::AppendVersion(rocksdb::ColumnFamilyData*, rocksdb::Version*) (version_set.cc:4946) ==3353864== by 0x876B88: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6876) ==3353864== by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483) ==3353864== by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187) ==3353864== by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31) ==3353864== by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729) ==3353864== by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522) ==3353864== by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928) ==3353864== by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743) ==3353864== by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720) ==3353864== by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710) ==3353864== Uninitialised value was created by a heap allocation ==3353864== at 0x4842F0F: operator new(unsigned long) (vg_replace_malloc.c:422) ==3353864== by 0x876AF4: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6870) ==3353864== by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483) ==3353864== by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187) ==3353864== by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31) ==3353864== by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729) ==3353864== by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522) ==3353864== by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928) ==3353864== by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743) ==3353864== by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720) ==3353864== by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710) ==3353864== by 0x591F73: rocksdb::DBTestBase::Reopen(rocksdb::Options const&) (db_test_util.cc:662) ``` This is likely about `lowest_unnecessary_level_` even though it would be initialized in `CalculateBaseBytes()` before being used in `ComputeCompactionScore()`. Initialize it also in VersionStorageInfo constructor to prevent valgrind from complaining. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11359 Test Plan: - ran a test with valgrind which gave the error message above before this PR: `valgrind --track-origins=yes ./db_compaction_test --gtest_filter="*DrainUnnecessaryLevelsAfterDBBecomesSmall*"` Reviewed By: hx235 Differential Revision: D44799112 Pulled By: cbi42 fbshipit-source-id: 557208a66f04a2163b418b2a651bdb7e777c4511
-
由 Peter Dillinger 提交于
Summary: After https://github.com/facebook/rocksdb/issues/11301, I wasn't sure whether I had regressed block cache tracing with MultiGet. Demo PR https://github.com/facebook/rocksdb/issues/11330 shows the flawed state of tracing MultiGet before my change, and based on the unit test, there was essentially no change in tracing behavior with https://github.com/facebook/rocksdb/issues/11301. This change is to leave that code and behavior better than I found it. This change is not intended to change any production behaviors except when block cache tracing is active, though might improve general read path efficiency by disabling some related tracking when such tracing is disabled. More detail on production code: * Refactoring to consolidate the construction of BlockCacheTraceRecord, and other related functionality, in block-based table reader, though it's somewhat awkward to preserve an optimization to avoid copying Slices into temporary strings in BlockCacheLookupContext. * Accurately track cache hits and misses (etc.) for each data block accessed by a MultiGet(). (Previously reported hits as misses.) * Reduced repeated checking of `block_cache_tracer_` state (by creating lookup_context only when active) for efficiency and to reduce the risk of corner case bugs where tracing is enabled or disabled for different parts of a read op. (See a TODO below) * Improved estimate calculation for num_keys_in_block (see code comment) Possible follow-up: * `XXX:` use_cache=true means double cache query? (possible double-query of block cache when allow_mmap_reads=true) * `TODO:` need more than one lookup_context here to track individual filter and index partition hits and misses * `TODO:` optimize more state checks of `block_cache_tracer_` down to `lookup_context != nullptr` * Pre-existing `XXX:` There appear to be 'break' statements above that bypass this writing of the block cache trace record * Expand test coverage (see below) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11339 Test Plan: * Added a basic unit test for block cache tracing MultiGet, for now just covering one data block with two keys. * Added HitMissCountingCache to independently verify that the actual block cache trace and expected block cache trace also agree with the actual number of cache hits / misses (nothing missing or mislabeled). For now only used with MultiGet test. * Better testing of num_keys_in_block, for now just with MultiGet * Misc improvements to table_test to improve clarity, such as making it clear that certain keys are auto-inserted at the start of every test. Performance test: Testing multireadrandom as in https://github.com/facebook/rocksdb/issues/11301, except averaging over distinct runs rather than [-X30] which doesn't seem to sufficiently reset after each run to work as an independent test run. Base with revert of 11301: 3148926 ops/sec Base: 3019146 ops/sec New: 2999529 ops/sec Possibly a tiny MultiGet CPU regression with this change. We are now always allocating an additional vector for the LookupContexts. I'm still contemplating options to try to correct the regression in https://github.com/facebook/rocksdb/issues/11301. Testing readrandom: Base with revert of 11301: 2311988 Base: 2281726 New: 2299722 Possibly a tiny Get CPU improvement with this change. We are now avoiding some unnecessary LookupContext population. Reviewed By: akankshamahajan15 Differential Revision: D44557845 Pulled By: pdillinger fbshipit-source-id: b841691799d2a48fb59cc8880dc7cbb1e107ae3d
-
由 Changyu Bi 提交于
Summary: when data block hash index finds a key of op_type `kTypeMerge`, do not redo data block seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11356 Test Plan: - added new unit test - crash test: `python3 tools/db_crashtest.py whitebox --simple --use_merge=1 --data_block_index_type=1` - benchmark see slight improvement in read throughput: ``` TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=mergerandom -use_existing_db=false -num=10000000 -compression_type=none -level_compaction_dynamic_level_bytes=1 -merge_operator=PutOperator -write_buffer_size=1000000 --use_data_block_hash_index=1 TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=readrandom[-X10] -use_existing_db=true -num=10000000 -merge_operator=PutOperator -readonly=1 -disable_auto_compactions=1 -reads=100000 Main: readrandom [AVG 10 runs] : 29526 (± 1118) ops/sec; 2.1 (± 0.1) MB/sec Post-PR: readrandom [AVG 10 runs] : 31095 (± 662) ops/sec; 2.2 (± 0.0) MB/sec ``` Reviewed By: pdillinger Differential Revision: D44759895 Pulled By: cbi42 fbshipit-source-id: 387f0c35938c7e0e96b810ca3babf1967fc68191
-
- 07 4月, 2023 2 次提交
-
-
由 Wentian Guo 提交于
Summary: If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332 Reviewed By: pdillinger Differential Revision: D44763497 Pulled By: guowentian fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
-
由 Changyu Bi 提交于
Summary: When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340 Test Plan: - New unit tests - `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run. Reviewed By: ajkr Differential Revision: D44563884 Pulled By: cbi42 fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
-
- 06 4月, 2023 2 次提交
-
-
由 anand76 提交于
Summary: VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328 Test Plan: Add a unit test Reviewed By: pdillinger Differential Revision: D44718781 Pulled By: anand1976 fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
-
由 Hui Xiao 提交于
Summary: **Context/Summary:** As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11355 Test Plan: - Ran previously failed tests and they succeed - Perf `./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3` Reviewed By: ajkr Differential Revision: D44719333 Pulled By: hx235 fbshipit-source-id: 23d22f314144071d97f7106ff1241c31c0bdf08b
-
- 05 4月, 2023 3 次提交
-
-
由 Andrew Kryczka 提交于
Summary: This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information. I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously. Fixes https://github.com/facebook/rocksdb/issues/10463 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208 Test Plan: Functional: - Measured `-verify_checksum=false` applies to metadata blocks read outside of table open - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56` - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true` - before: `rocksdb.block.checksum.compute.count COUNT : 384353` - after: `rocksdb.block.checksum.compute.count COUNT : 22` Performance: - Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1` - Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000` - Before: `readrandom [AVG 20 runs] : 274848 (± 3717) ops/sec; 8.4 (± 0.1) MB/sec` - After: `readrandom [AVG 20 runs] : 277904 (± 4474) ops/sec; 8.5 (± 0.1) MB/sec` Reviewed By: hx235 Differential Revision: D43145366 Pulled By: ajkr fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994
-
由 Peter Dillinger 提交于
Summary: I previously misread or misinterpreted API contracts for SecondaryCache and this should correct the record. (Follow-up item from https://github.com/facebook/rocksdb/issues/11301) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11316 Test Plan: comments only Reviewed By: anand1976 Differential Revision: D44245107 Pulled By: pdillinger fbshipit-source-id: 3f8ddec150674b75728f1730f99b963bbf7b76e7
-
由 Peter Dillinger 提交于
Summary: ... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that. Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards. The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350 Test Plan: unit tests updated Reviewed By: cbi42 Differential Revision: D44674873 Pulled By: pdillinger fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
-