1. 26 7月, 2022 3 次提交
    • P
      Account for DB ID in stress testing block cache keys (#10388) · 01a2e202
      Peter Dillinger 提交于
      Summary:
      I recently discovered that block cache keys are slightly lower
      quality than previously thought, because my stress testing tool failed
      to simulate the effect of DB ID differences. This change updates the
      tool and gives us data to guide future developments. (No changes to
      production code here and now.)
      
      Nevertheless, the following promise still holds
      
      ```
      // In fact, if our SST files are all < 4TB (see
      // BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated
      // in a single process are guaranteed to have unique cache keys, unless/until
      // number session ids * max file number = 2**86 ...
      ```
      
      because although different DB IDs could cause collision in file number
      and offset data, that would have to be using the same DB session (lower)
      to cause a block cache key collision, which is not possible in the same
      process. (A session is associated with only one DB ID.)
      
      This change fixes cache_bench -stress_cache_key to set and reset DB IDs in
      a parameterized way to evaluate the effect. Previous results assumed to
      be representative (using -sck_keep_bits=43):
      
      ```
      15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected)
      ```
      
      or expected collision on a single machine every 104 billion billion
      days (see "corrected" value).
      
      After accounting for DB IDs, test never really changing, intermediate, and very
      frequently changing (using default -sck_db_count=100):
      
      ```
      -sck_newdb_nreopen=1000000000:
      15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected)
      -sck_newdb_nreopen=10000:
      17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected)
      -sck_newdb_nreopen=100:
      19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected)
      ```
      
      or roughly 10x more often than previously thought (still extremely if
      not impossibly rare), and better than random base cache keys
      (with -sck_randomize), though < 10x better than random:
      
      ```
      31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected)
      ```
      
      If we simply fixed this by ignoring DB ID for cache keys, we would
      potentially have a shortage of entropy for some cases, such as small
      file numbers and offsets (e.g. many short-lived processes each using
      SstFileWriter to create a small file), because existing DB session IDs
      only provide ~103 bits of entropy. We could upgrade the entropy in DB
      session IDs to accommodate, but it's not known what all would be
      affected by changing from 20 digit session IDs to something larger.
      
      Instead, my plan is to
      1) Move to block cache keys derived from SST unique IDs (so that we can
      derive block cache keys from manifest data without reading file on
      storage), and show no significant regression in expected collision
      rate.
      2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058),
      which should have ~100x lower expected/predicted collision rate based
      on simulations with this stress test:
      ```
      ./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id
      ...
      15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388
      
      Test Plan: no production changes
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37986714
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df
      01a2e202
    • S
      Fix a bug in hash linked list (#10401) · 4e007480
      sdong 提交于
      Summary:
      In hash linked list, with a bucket of only one record, following sequence can cause users to temporarily miss a record:
      
      Thread 1: Fetch the structure bucket x points too, which would be a Node n1 for a key, with next pointer to be null
      Thread 2: Insert a key to bucket x that is larger than the existing key. This will make n1->next points to a new node n2, and update bucket x to point to n1.
      Thread 1: see n1->next is not null, so it thinks it is a header of linked list and ignore the key of n1.
      
      Fix it by refetch structure that bucket x points to when it sees n1->next is not null. This should work because if n1->next is not null, bucket x should already point to a linked list or skip list header.
      
      A related change is to revert th order of testing for linked list and skip list. This is because after refetching the bucket, it might end up with a skip list, rather than linked list.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10401
      
      Test Plan: Run existing tests and make sure at least it doesn't regress.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38064471
      
      fbshipit-source-id: 142bb85e1546c803f47e3357aef3e76debccd8df
      4e007480
    • G
      Lock-free ClockCache (#10390) · 6a160e1f
      Guido Tagliavini Ponce 提交于
      Summary:
      ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390
      
      Test Plan:
      - ``make -j24 check``
      - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``
      
      Reviewed By: pdillinger
      
      Differential Revision: D38106945
      
      Pulled By: guidotag
      
      fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1
      6a160e1f
  2. 25 7月, 2022 1 次提交
    • Z
      Support subcmpct using reserved resources for round-robin priority (#10341) · 8860fc90
      Zichen Zhu 提交于
      Summary:
      Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
       * Constraint 1: We can only pick consecutive files
         - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
         - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
       * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
       * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
       * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
      
      More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
      
      The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit.  The number of subcompactions for round-robin compaction priority is determined through the following steps:
      * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
      * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
      * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
      
      More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
      
      Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
      
      Reviewed By: ajkr, hx235
      
      Differential Revision: D37792644
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
      8860fc90
  3. 24 7月, 2022 1 次提交
    • S
      Improve SubCompaction Partitioning (#10393) · 252bea40
      sdong 提交于
      Summary:
      Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393
      
      Test Plan:
      1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
      2. Add a new unit test to ApproximateKeyAnchors()
      3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38043783
      
      fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
      252bea40
  4. 23 7月, 2022 6 次提交
  5. 22 7月, 2022 3 次提交
  6. 20 7月, 2022 3 次提交
    • A
      Fix explanation of XOR usage in KV checksum blog post (#10392) · a0c63083
      Andrew Kryczka 提交于
      Summary:
      Thanks pdillinger for reminding us that we are protected from swapping corruptions due to independent seeds (and for suggesting that approach in the first place).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10392
      
      Reviewed By: cbi42
      
      Differential Revision: D37981819
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3ed32982ae1dbc88eb92569010f9f2e8d190c962
      a0c63083
    • Y
      Stop operating on DB in a stress test background thread (#10373) · b443d24f
      Yanqin Jin 提交于
      Summary:
      Stress test background threads do not coordinate with test worker
      threads for db reopen in the middle of a test run, thus accessing db
      obj in a stress test bg thread can race with test workers. Remove the
      TimestampedSnapshotThread.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373
      
      Test Plan:
      ```
      ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \
      --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \
      --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \
      --block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \
      --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \
      --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \
      --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
      --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \
      --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \
      --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \
      --continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \
      --db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \
      --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \
      --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \
      --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
      --get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \
      --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
      --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \
      --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \
      --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \
      --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \
      --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \
      --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \
      --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \
      --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \
      --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \
      --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \
      --readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \
      --secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \
      --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \
      --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \
      --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \
      --txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \
      --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \
      --use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \
      --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \
      --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \
      --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35
      ```
      make crash_test_with_txn
      make crash_test_with_multiops_wc_txn
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37903189
      
      Pulled By: riversand963
      
      fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9
      b443d24f
    • A
      Fix race conditions in GenericRateLimiter (#10374) · e576f2ab
      Andrew Kryczka 提交于
      Summary:
      Made locking strict for all accesses of `GenericRateLimiter` internal state.
      
      `SetBytesPerSecond()` was the main problem since it had no locking, while the two updates it makes need to be done as one atomic operation.
      
      The test case, "ConfigOptionsTest.ConfiguringOptionsDoesNotRevertRateLimiterBandwidth", is for the issue fixed in https://github.com/facebook/rocksdb/issues/10378, but I forgot to include the test there.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10374
      
      Reviewed By: pdillinger
      
      Differential Revision: D37906367
      
      Pulled By: ajkr
      
      fbshipit-source-id: ccde620d2a7f96d1401bdafd2bdb685cbefbafa5
      e576f2ab
  7. 19 7月, 2022 6 次提交
  8. 17 7月, 2022 2 次提交
  9. 16 7月, 2022 8 次提交
  10. 15 7月, 2022 4 次提交
    • S
      DB::PutEntity() shouldn't be defined as =0 (#10364) · 00e68e7a
      sdong 提交于
      Summary:
      DB::PutEntity() is defined as 0, but it is actually implemented in db/db_impl/db_impl_write.cc. It is incorrect, and might cause problems when users implement class DB themselves.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10364
      
      Test Plan: See existing tests pass
      
      Reviewed By: riversand963
      
      Differential Revision: D37874886
      
      fbshipit-source-id: b81713ddb707720b52d57a15de56a59414c24f66
      00e68e7a
    • J
      Add seqno to time mapping (#10338) · a3acf2ef
      Jay Zhuang 提交于
      Summary:
      Which will be used for tiered storage to preclude hot data from
      compacting to the cold tier (the last level).
      Internally, adding seqno to time mapping. A periodic_task is scheduled
      to record the current_seqno -> current_time in certain cadence. When
      memtable flush, the mapping informaiton is stored in sstable property.
      During compaction, the mapping information are merged and get the
      approximate time of sequence number, which is used to determine if a key
      is recently inserted or not and preclude it from the last level if it's
      recently inserted (within the `preclude_last_level_data_seconds`).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338
      
      Test Plan: CI
      
      Reviewed By: siying
      
      Differential Revision: D37810187
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
      a3acf2ef
    • S
      Fix HISTORY.md for misplaced items (#10362) · 66685d6a
      Siying Dong 提交于
      Summary:
      Some items are misplaced to 7.4 but they are unreleased.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10362
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37859426
      
      fbshipit-source-id: e2ad099227309ed2e0f3ca450a9a43986d681c7c
      66685d6a
    • S
      Make InternalKeyComparator not configurable (#10342) · c8b20d46
      sdong 提交于
      Summary:
      InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
      We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342
      
      Test Plan: Run existing CI tests and make sure it doesn't fail
      
      Reviewed By: riversand963
      
      Differential Revision: D37771351
      
      fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
      c8b20d46
  11. 14 7月, 2022 3 次提交