1. 24 6月, 2022 2 次提交
    • Y
      Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175) · 2a3792ed
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add suggest_compact_range() and suggest_compact_range_cf() to C API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175
      
      Test Plan:
      As verifying the result requires SyncPoint, which is not available in the c_test.c,
      the test is currently done by invoking the functions and making sure it does not crash.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37305191
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
      2a3792ed
    • B
      Dynamically changeable `MemPurge` option (#10011) · 5879053f
      Baptiste Lemaire 提交于
      Summary:
      **Summary**
      Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
      
      **Motivation**
      RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
      Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
      
      **Content of this PR**
      This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
      
      **Benchmarking**
      I will add numbers to prove that there is no performance impact within the next 12 hours.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
      
      Reviewed By: pdillinger
      
      Differential Revision: D36462357
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
      5879053f
  2. 23 6月, 2022 1 次提交
    • Y
      Add get_column_family_metadata() and related functions to C API (#10207) · e103b872
      Yueh-Hsuan Chiang 提交于
      Summary:
      * Add metadata related structs and functions in C API, including
        - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
           that returns `rocksdb_column_family_metadata_t`.
        - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
        - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
        - `rocksdb_file_metadata_t` and its and get functions & destroy functions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207
      
      Test Plan:
      Extend the existing c_test.c to include additional checks for column_family_metadata
      inside CheckCompaction.
      
      Reviewed By: riversand963
      
      Differential Revision: D37305209
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
      e103b872
  3. 15 6月, 2022 1 次提交
  4. 03 6月, 2022 1 次提交
    • G
      Make it possible to enable blob files starting from a certain LSM tree level (#10077) · e6432dfd
      Gang Liao 提交于
      Summary:
      Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
      
      In order to achieve this, we would like to do the following:
      - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
      - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
      - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
      - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
      - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
      - Ideally extend the C and Java bindings with the new option
      - Update the BlobDB wiki to document the new option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
      
      Reviewed By: ltamasi
      
      Differential Revision: D36884156
      
      Pulled By: gangliao
      
      fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
      e6432dfd
  5. 27 5月, 2022 1 次提交
  6. 26 5月, 2022 2 次提交
    • J
      Expose DisableManualCompaction and EnableManualCompaction to C api (#10052) · 4cf2f672
      Jie Liang Ang 提交于
      Summary:
      Add `rocksdb_disable_manual_compaction` and `rocksdb_enable_manual_compaction`.
      
      Note that `rocksdb_enable_manual_compaction` should be used with care and must not be called more times than `rocksdb_disable_manual_compaction` has been called.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10052
      
      Reviewed By: ajkr
      
      Differential Revision: D36665496
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: a4ae6e34694066feb21302ca1a5c365fb9de0ec7
      4cf2f672
    • Y
      Improve transaction C-API (#9252) · b71466e9
      Yiyuan Liu 提交于
      Summary:
      This PR wants to improve support for transaction in C-API:
      * Support two-phase commit.
      * Support `get_pinned` and `multi_get` in transaction.
      * Add `rocksdb_transactiondb_flush`
      * Support get writebatch from transaction and rebuild transaction from writebatch.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9252
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36459007
      
      Pulled By: riversand963
      
      fbshipit-source-id: 47371d527be821c496353a7fe2fd18d628069a98
      b71466e9
  7. 21 5月, 2022 2 次提交
    • A
      Seek parallelization (#9994) · 2db6a4a1
      Akanksha Mahajan 提交于
      Summary:
      The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
      This flow can be parallelized.
      
      Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
      - Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
      - If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
      - If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994
      
      Test Plan:
      - **Run Regressions.**
      ```
      ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
      ```
      i) Previous release 7.0 run for normal prefetching with async_io disabled:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
      ii) normal prefetching after changes with async_io disable:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Set seed to 1652922591315307 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:09:51 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations;  340.8 MB/s (249 of 249 found)
      ```
      iii) db_bench with async_io enabled completed succesfully
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
      Set seed to 1652924062021732 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:34:22 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations;  293.6 MB/s (217 of 217 found)
      ```
      
      - db_stress with async_io disabled completed succesfully
      ```
       export CRASH_TEST_EXT_ARGS=" --async_io=0"
       make crash_test -j
      ```
      
      I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.
      
      Reviewed By: anand1976
      
      Differential Revision: D36459323
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c
      2db6a4a1
    • C
      Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) · cc23b46d
      Changyu Bi 提交于
      Summary:
      An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
      
      Test Plan:
      #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
      Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
      ```
      # linked local ZSTD with version 1.5.2
      # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
      
      dict_bytes=16384
      train_bytes=1048576
      echo "========== No Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== Raw Content Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== FinalizeDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== TrainDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
      # before compression data size: 1.2GB
      dict_bytes=16384
      max_dict_buffer_bytes =  1048576
                          space   cpu/memory
      No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
      Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
      FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
      TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
      ```
      
      #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
      FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
      ```
      dict_bytes=16384
      train_bytes=1048576
      
      for sst_file in `ls ../temp/myrock-sst/`
      do
        echo "********** $sst_file **********"
        echo "========== No Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
      
        echo "========== Raw Content Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
      
        echo "========== FinalizeDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
      
        echo "========== TrainDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
      done
      
                               010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
      No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
      Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
      FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
      TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
      ```
      
      #### Decompression/Read test:
      With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
      ```
      dict_bytes=16384
      train_bytes=1048576
      echo "No Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
      
      echo "Raw Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
      
      echo "FinalizeDict"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
      
      echo "Train Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
      
      No Dictionary
      readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
      Raw Dictionary
      readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
      FinalizeDict
      readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
      Train Dictionary
      readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D35720026
      
      Pulled By: cbi42
      
      fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
      cc23b46d
  8. 13 5月, 2022 1 次提交
    • Y
      Port the batched version of MultiGet() to RocksDB's C API (#9952) · bcb12872
      Yueh-Hsuan Chiang 提交于
      Summary:
      The batched version of MultiGet() is not available in RocksDB's C API.
      This PR implements rocksdb_batched_multi_get_cf which is a C wrapper function
      that invokes the batched version of MultiGet() which takes one single column family.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9952
      
      Test Plan: Added a new test case under "columnfamilies" test case in c_test.cc
      
      Reviewed By: riversand963
      
      Differential Revision: D36302888
      
      Pulled By: ajkr
      
      fbshipit-source-id: fa134c4a1c8e7d72dd4ae8649a74e3797b5cf4e6
      bcb12872
  9. 20 4月, 2022 1 次提交
  10. 24 3月, 2022 1 次提交
    • P
      Fix a major performance bug in 7.0 re: filter compatibility (#9736) · 91687d70
      Peter Dillinger 提交于
      Summary:
      Bloom filters generated by pre-7.0 releases are not read by
      7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
      in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on
      upgrade or downgrade with existing DB, but not data correctness.
      
      To fix, we go back using the old, unified name in SST metadata but (for
      a while anyway) recognize the aliases that could be generated by early
      7.0.x releases. This unfortunately requires a public API change to avoid
      interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change
      only affects users with custom FilterPolicy, which should be very few.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736
      
      Test Plan:
      manual
      
      Generate DBs with
      ```
      ./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
      ```
      and similar. Compare with
      ```
      for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
      ```
      
      Results:
      ```
      Testing 6.29 on 6.29:
      readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
      Testing 6.29 on 7.0:
      readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
      Testing 6.29 on fixed:
      readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
      Testing 7.0 on 6.29:
      readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
      Testing 7.0 on 7.0:
      readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
      Testing 7.0 on fixed:
      readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
      Testing fixed on 6.29:
      readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
      Testing fixed on 7.0:
      readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
      Testing fixed on fixed:
      readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
      ```
      
      Just paying attention to order of magnitude of ops/sec (short test
      durations, lots of noise), it's clear that with the fix we can read <= 6.29
      & >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
      release can properly read fixed DB at full speed.
      
      Reviewed By: siying, ajkr
      
      Differential Revision: D35057844
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050
      91687d70
  11. 02 3月, 2022 1 次提交
  12. 12 2月, 2022 1 次提交
    • A
      Fix failure in c_test (#9547) · 5c53b900
      Akanksha Mahajan 提交于
      Summary:
      When tests are run with TMPD, c_test may fail because TMPD
      is not created by the test. It results in IO error: No such file
      or directory: While mkdir if missing:
      /tmp/rocksdb_test_tmp/rocksdb_c_test-0: No such file or directory
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9547
      
      Test Plan:
      make -j32 c_test;
       TEST_TMPDIR=/tmp/rocksdb_test  ./c_test
      
      Reviewed By: riversand963
      
      Differential Revision: D34173298
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5b5a01f5b842c2487b05b0708c8e9532241db7f8
      5c53b900
  13. 09 2月, 2022 1 次提交
    • P
      FilterPolicy API changes for 7.0 (#9501) · 68a9c186
      Peter Dillinger 提交于
      Summary:
      * Inefficient block-based filter is no longer customizable in the public
      API, though (for now) can still be enabled.
        * Removed deprecated FilterPolicy::CreateFilter() and
        FilterPolicy::KeyMayMatch()
        * Removed `rocksdb_filterpolicy_create()` from C API
      * Change meaning of nullptr return from GetBuilderWithContext() from "use
      block-based filter" to "generate no filter in this case." This is a
      cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250.
        * Also, when user specifies bits_per_key < 0.5, we now round this down
        to "no filter" because we expect a filter with >= 80% FP rate is
        unlikely to be worth the CPU cost of accessing it (esp with
        cache_index_and_filter_blocks=1 or partition_filters=1).
        * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
        rate)
        * This also gives us some support for configuring filters from OPTIONS
        file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`.
        Opening from such an options file will enable reading filters (an
        improvement) but not writing new ones. (See Customizable follow-up
        below.)
      * Also removed deprecated functions
        * FilterBitsBuilder::CalculateNumEntry()
        * FilterPolicy::GetFilterBitsBuilder()
        * NewExperimentalRibbonFilterPolicy()
      * Remove default implementations of
        * FilterBitsBuilder::EstimateEntriesAdded()
        * FilterBitsBuilder::ApproximateNumEntries()
        * FilterPolicy::GetBuilderWithContext()
      * Remove support for "filter_policy=experimental_ribbon" configuration
      string.
      * Allow "filter_policy=bloomfilter:n" without bool to discourage use of
      block-based filter.
      
      Some pieces for https://github.com/facebook/rocksdb/issues/9389
      
      Likely follow-up (later PRs):
      * Refactoring toward FilterPolicy Customizable, so that we can generate
      filters with same configuration as before when configuring from options
      file.
      * Remove support for user enabling block-based filter (ignore `bool
      use_block_based_builder`)
        * Some months after this change, we could even remove read support for
        block-based filter, because it is not critical to DB data
        preservation.
      * Make FilterBitsBuilder::FinishV2 to avoid `using
      FilterBitsBuilder::Finish` mess and add support for specifying a
      MemoryAllocator (for cache warming)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501
      
      Test Plan:
      A number of obsolete tests deleted and new tests or test
      cases added or updated.
      
      Reviewed By: hx235
      
      Differential Revision: D34008011
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa
      68a9c186
  14. 04 2月, 2022 1 次提交
  15. 29 1月, 2022 1 次提交
    • H
      Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds (#9455) · 42cca28e
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
      - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
      - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
      
      Test Plan: Rely on my eyeball and CI
      
      Reviewed By: ajkr
      
      Differential Revision: D33811664
      
      Pulled By: hx235
      
      fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
      42cca28e
  16. 28 1月, 2022 5 次提交
  17. 27 1月, 2022 2 次提交
  18. 13 1月, 2022 1 次提交
  19. 31 12月, 2021 1 次提交
    • Y
      Fix a bug in C-binding causing iterator to return incorrect result (#9343) · 677d2b4a
      Yanqin Jin 提交于
      Summary:
      Fixes https://github.com/facebook/rocksdb/issues/9339
      
      When writing SST file, the name, computed as `prefix_extractor->GetId()` will be written to the properties block.
      When the SST is opened again in the future, `CreateFromString()` will take the name as argument and try
      to create a prefix extractor object. Without this fix, the C API will pass a `Wrapper` pointer to the underlying
      DB's `prefix_extractor`. `Wrapper::GetId()`, in this case, will be missing the prefix length component, causing a
      prefix extractor of length 0 to be silently created and used.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9343
      
      Test Plan:
      ```
      make c_test
      ./c_test
      ```
      
      Reviewed By: mrambacher
      
      Differential Revision: D33355549
      
      Pulled By: riversand963
      
      fbshipit-source-id: c92c3acd8be262c3bff8794b4229e42b9ee31203
      677d2b4a
  20. 01 12月, 2021 1 次提交
  21. 20 11月, 2021 1 次提交
    • L
      Support readahead during compaction for blob files (#9187) · dc5de45a
      Levi Tamasi 提交于
      Summary:
      The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
      that can be used to enable prefetching data from blob files during compaction.
      This is important when using storage with higher latencies like HDDs or remote filesystems.
      If enabled, prefetching is used for all cases when blobs are read during compaction,
      namely garbage collection, compaction filters (when the existing value has to be read from
      a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
      
      Test Plan: Ran `make check` and the stress/crash test.
      
      Reviewed By: riversand963
      
      Differential Revision: D32565512
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
      dc5de45a
  22. 12 10月, 2021 1 次提交
    • L
      Make it possible to force the garbage collection of the oldest blob files (#8994) · 3e1bf771
      Levi Tamasi 提交于
      Summary:
      The current BlobDB garbage collection logic works by relocating the valid
      blobs from the oldest blob files as they are encountered during compaction,
      and cleaning up blob files once they contain nothing but garbage. However,
      with sufficiently skewed workloads, it is theoretically possible to end up in a
      situation when few or no compactions get scheduled for the SST files that contain
      references to the oldest blob files, which can lead to increased space amp due
      to the lack of GC.
      
      In order to efficiently handle such workloads, the patch adds a new BlobDB
      configuration option called `blob_garbage_collection_force_threshold`,
      which signals to BlobDB to schedule targeted compactions for the SST files
      that keep alive the oldest batch of blob files if the overall ratio of garbage in
      the given blob files meets the threshold *and* all the given blob files are
      eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
      if the new option is set to 0.9, targeted compactions will get scheduled if the
      sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
      oldest blob files, assuming all affected blob files are below the age-based cutoff.)
      The net result of these targeted compactions is that the valid blobs in the oldest
      blob files are relocated and the oldest blob files themselves cleaned up (since
      *all* SST files that rely on them get compacted away).
      
      These targeted compactions are similar to periodic compactions in the sense
      that they force certain SST files that otherwise would not get picked up to undergo
      compaction and also in the sense that instead of merging files from multiple levels,
      they target a single file. (Note: such compactions might still include neighboring files
      from the same level due to the need of having a "clean cut" boundary but they never
      include any files from any other level.)
      
      This functionality is currently only supported with the leveled compaction style
      and is inactive by default (since the default value is set to 1.0, i.e. 100%).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
      
      Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
      
      Reviewed By: riversand963
      
      Differential Revision: D31489850
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
      3e1bf771
  23. 21 8月, 2021 1 次提交
    • P
      Add Bloom/Ribbon hybrid API support (#8679) · 2a383f21
      Peter Dillinger 提交于
      Summary:
      This is essentially resurrection and fixing of the part of
      https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
      when configuring Ribbon filter, you can specify an LSM level before which
      Bloom will be used instead of Ribbon. But Bloom is only considered for
      Leveled and Universal compaction styles and file going into a known LSM
      level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
      you would expect with NewRibbonFilterPolicy.
      
      So that this can be controlled with a single int value and so that flushes
      can be distinguished from intra-L0, we consider flush to go to level -1 for
      the purposes of this option. (Explained in API comment.)
      
      I also expect the most common and recommended Ribbon configuration to
      use Bloom during flush, to minimize slowing down writes and because according
      to my estimates, Ribbon only pays off if the structure lives in memory for
      more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
      to be this mild hybrid configuration. I don't really want to add something like
      NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
      flush, Ribbon otherwise) should be considered a natural choice.
      
      C APIs also updated, but because they don't support overloading,
      rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
      rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
      configuration. While touching C API, I changed bits per key options from
      int to double.
      
      BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
      unused fields from BloomFilterPolicy.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
      
      Test Plan: new + updated tests, including crash test
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D30445797
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
      2a383f21
  24. 11 8月, 2021 1 次提交
    • B
      Memtable sampling for mempurge heuristic. (#8628) · e3a96c48
      Baptiste Lemaire 提交于
      Summary:
      Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
      This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
      Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
      At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
      Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
      The unit tests have been readapted to support this new API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
      
      Reviewed By: pdillinger
      
      Differential Revision: D30149315
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
      e3a96c48
  25. 07 8月, 2021 1 次提交
  26. 10 7月, 2021 1 次提交
  27. 02 7月, 2021 1 次提交
    • B
      Memtable "MemPurge" prototype (#8454) · 9dc887ec
      Baptiste Lemaire 提交于
      Summary:
      Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
      The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
      Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
      One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
      
      Reviewed By: anand1976
      
      Differential Revision: D29433971
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
      9dc887ec
  28. 18 5月, 2021 1 次提交
  29. 28 4月, 2021 2 次提交
  30. 23 4月, 2021 1 次提交
  31. 16 4月, 2021 1 次提交
    • M
      Add Blob Options to C API (#8148) · 4c41e51c
      mrambacher 提交于
      Summary:
      Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API.
      
      There are no tests for getting/setting options in the C API currently, hence no specific test plans.  Should there be a some?
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148
      
      Reviewed By: ltamasi
      
      Differential Revision: D27568495
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6
      4c41e51c