1. 28 1月, 2023 2 次提交
    • S
      Remove RocksDB LITE (#11147) · 4720ba43
      sdong 提交于
      Summary:
      We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
      
      Most of changes were done through following comments:
      
      unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
      
      by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
      
      Test Plan: See CI
      
      Reviewed By: pdillinger
      
      Differential Revision: D42796341
      
      fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
      4720ba43
    • Y
      Remove deprecated util functions in options_util.h (#11126) · 6943ff6e
      Yu Zhang 提交于
      Summary:
      Remove the util functions in options_util.h that have previously been marked deprecated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126
      
      Test Plan: `make check`
      
      Reviewed By: ltamasi
      
      Differential Revision: D42757496
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
      6943ff6e
  2. 25 1月, 2023 1 次提交
  3. 05 1月, 2023 1 次提交
  4. 31 12月, 2022 1 次提交
  5. 03 11月, 2022 1 次提交
    • A
      Ran clang-format on db/ directory (#10910) · 5cf6ab6f
      Andrew Kryczka 提交于
      Summary:
      Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
      
      Reviewed By: riversand963
      
      Differential Revision: D40880683
      
      Pulled By: ajkr
      
      fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
      5cf6ab6f
  6. 23 9月, 2022 1 次提交
  7. 03 9月, 2022 1 次提交
    • C
      Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) · 30bc495c
      Changyu Bi 提交于
      Summary:
      Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
      
      With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
      - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
      - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
      
      This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
      
      One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
      
      Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
      
      Test Plan:
      - Added many unit tests in db_range_del_test
      - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
      - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
      
      ```
      python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
      ```
      
      - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
      ```
      # Setup:
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
      
      # Scan entire DB
      TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
      
      # Short range scan (10 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
      
      # Long range scan(1000 Next())
      TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
      ```
      Avg over of 10 runs (some slower tests had fews runs):
      
      For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
      
      - Scan entire DB
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
      | 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
      | 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
      | 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
      | 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
      - Short range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
      | 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
      | 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
      | 10000 |1367                 |28667                |+1997.07%            |
      | 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |
      
      - Long range scan
      
      | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
      | ------------- | ------------- | ------------- |  ------------- |
      | 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
      | 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
      | 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
      | 10000 |25                   |414                  |+1556.0%             |
      | 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |
      
      - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
      
      Reviewed By: ajkr
      
      Differential Revision: D38450331
      
      Pulled By: cbi42
      
      fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
      30bc495c
  8. 25 8月, 2022 1 次提交
  9. 24 8月, 2022 1 次提交
  10. 17 7月, 2022 2 次提交
  11. 01 7月, 2022 1 次提交
  12. 30 6月, 2022 1 次提交
  13. 29 6月, 2022 1 次提交
  14. 24 6月, 2022 2 次提交
    • Y
      Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175) · 2a3792ed
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add suggest_compact_range() and suggest_compact_range_cf() to C API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175
      
      Test Plan:
      As verifying the result requires SyncPoint, which is not available in the c_test.c,
      the test is currently done by invoking the functions and making sure it does not crash.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D37305191
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
      2a3792ed
    • B
      Dynamically changeable `MemPurge` option (#10011) · 5879053f
      Baptiste Lemaire 提交于
      Summary:
      **Summary**
      Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
      
      **Motivation**
      RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
      Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
      
      **Content of this PR**
      This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
      
      **Benchmarking**
      I will add numbers to prove that there is no performance impact within the next 12 hours.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
      
      Reviewed By: pdillinger
      
      Differential Revision: D36462357
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
      5879053f
  15. 23 6月, 2022 1 次提交
    • Y
      Add get_column_family_metadata() and related functions to C API (#10207) · e103b872
      Yueh-Hsuan Chiang 提交于
      Summary:
      * Add metadata related structs and functions in C API, including
        - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
           that returns `rocksdb_column_family_metadata_t`.
        - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
        - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
        - `rocksdb_file_metadata_t` and its and get functions & destroy functions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207
      
      Test Plan:
      Extend the existing c_test.c to include additional checks for column_family_metadata
      inside CheckCompaction.
      
      Reviewed By: riversand963
      
      Differential Revision: D37305209
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
      e103b872
  16. 15 6月, 2022 1 次提交
  17. 03 6月, 2022 1 次提交
    • G
      Make it possible to enable blob files starting from a certain LSM tree level (#10077) · e6432dfd
      Gang Liao 提交于
      Summary:
      Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
      
      In order to achieve this, we would like to do the following:
      - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
      - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
      - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
      - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
      - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
      - Ideally extend the C and Java bindings with the new option
      - Update the BlobDB wiki to document the new option.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
      
      Reviewed By: ltamasi
      
      Differential Revision: D36884156
      
      Pulled By: gangliao
      
      fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
      e6432dfd
  18. 27 5月, 2022 1 次提交
  19. 26 5月, 2022 2 次提交
    • J
      Expose DisableManualCompaction and EnableManualCompaction to C api (#10052) · 4cf2f672
      Jie Liang Ang 提交于
      Summary:
      Add `rocksdb_disable_manual_compaction` and `rocksdb_enable_manual_compaction`.
      
      Note that `rocksdb_enable_manual_compaction` should be used with care and must not be called more times than `rocksdb_disable_manual_compaction` has been called.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10052
      
      Reviewed By: ajkr
      
      Differential Revision: D36665496
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: a4ae6e34694066feb21302ca1a5c365fb9de0ec7
      4cf2f672
    • Y
      Improve transaction C-API (#9252) · b71466e9
      Yiyuan Liu 提交于
      Summary:
      This PR wants to improve support for transaction in C-API:
      * Support two-phase commit.
      * Support `get_pinned` and `multi_get` in transaction.
      * Add `rocksdb_transactiondb_flush`
      * Support get writebatch from transaction and rebuild transaction from writebatch.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9252
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36459007
      
      Pulled By: riversand963
      
      fbshipit-source-id: 47371d527be821c496353a7fe2fd18d628069a98
      b71466e9
  20. 21 5月, 2022 2 次提交
    • A
      Seek parallelization (#9994) · 2db6a4a1
      Akanksha Mahajan 提交于
      Summary:
      The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
      This flow can be parallelized.
      
      Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
      - Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
      - If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
      - If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994
      
      Test Plan:
      - **Run Regressions.**
      ```
      ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
      ```
      i) Previous release 7.0 run for normal prefetching with async_io disabled:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
      ii) normal prefetching after changes with async_io disable:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Set seed to 1652922591315307 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:09:51 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations;  340.8 MB/s (249 of 249 found)
      ```
      iii) db_bench with async_io enabled completed succesfully
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
      Set seed to 1652924062021732 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:34:22 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations;  293.6 MB/s (217 of 217 found)
      ```
      
      - db_stress with async_io disabled completed succesfully
      ```
       export CRASH_TEST_EXT_ARGS=" --async_io=0"
       make crash_test -j
      ```
      
      I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.
      
      Reviewed By: anand1976
      
      Differential Revision: D36459323
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c
      2db6a4a1
    • C
      Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) · cc23b46d
      Changyu Bi 提交于
      Summary:
      An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
      
      Test Plan:
      #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
      Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
      ```
      # linked local ZSTD with version 1.5.2
      # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
      
      dict_bytes=16384
      train_bytes=1048576
      echo "========== No Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== Raw Content Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== FinalizeDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== TrainDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
      # before compression data size: 1.2GB
      dict_bytes=16384
      max_dict_buffer_bytes =  1048576
                          space   cpu/memory
      No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
      Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
      FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
      TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
      ```
      
      #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
      FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
      ```
      dict_bytes=16384
      train_bytes=1048576
      
      for sst_file in `ls ../temp/myrock-sst/`
      do
        echo "********** $sst_file **********"
        echo "========== No Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
      
        echo "========== Raw Content Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
      
        echo "========== FinalizeDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
      
        echo "========== TrainDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
      done
      
                               010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
      No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
      Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
      FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
      TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
      ```
      
      #### Decompression/Read test:
      With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
      ```
      dict_bytes=16384
      train_bytes=1048576
      echo "No Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
      
      echo "Raw Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
      
      echo "FinalizeDict"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
      
      echo "Train Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
      
      No Dictionary
      readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
      Raw Dictionary
      readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
      FinalizeDict
      readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
      Train Dictionary
      readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D35720026
      
      Pulled By: cbi42
      
      fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
      cc23b46d
  21. 13 5月, 2022 1 次提交
    • Y
      Port the batched version of MultiGet() to RocksDB's C API (#9952) · bcb12872
      Yueh-Hsuan Chiang 提交于
      Summary:
      The batched version of MultiGet() is not available in RocksDB's C API.
      This PR implements rocksdb_batched_multi_get_cf which is a C wrapper function
      that invokes the batched version of MultiGet() which takes one single column family.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9952
      
      Test Plan: Added a new test case under "columnfamilies" test case in c_test.cc
      
      Reviewed By: riversand963
      
      Differential Revision: D36302888
      
      Pulled By: ajkr
      
      fbshipit-source-id: fa134c4a1c8e7d72dd4ae8649a74e3797b5cf4e6
      bcb12872
  22. 20 4月, 2022 1 次提交
  23. 24 3月, 2022 1 次提交
    • P
      Fix a major performance bug in 7.0 re: filter compatibility (#9736) · 91687d70
      Peter Dillinger 提交于
      Summary:
      Bloom filters generated by pre-7.0 releases are not read by
      7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name()
      in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on
      upgrade or downgrade with existing DB, but not data correctness.
      
      To fix, we go back using the old, unified name in SST metadata but (for
      a while anyway) recognize the aliases that could be generated by early
      7.0.x releases. This unfortunately requires a public API change to avoid
      interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change
      only affects users with custom FilterPolicy, which should be very few.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736
      
      Test Plan:
      manual
      
      Generate DBs with
      ```
      ./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0
      ```
      and similar. Compare with
      ```
      for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done
      ```
      
      Results:
      ```
      Testing 6.29 on 6.29:
      readrandom   :      34.381 micros/op 29085 ops/sec;    3.2 MB/s (291999 of 291999 found)
      Testing 6.29 on 7.0:
      readrandom   :     190.443 micros/op 5249 ops/sec;    0.6 MB/s (52999 of 52999 found)
      Testing 6.29 on fixed:
      readrandom   :      40.148 micros/op 24907 ops/sec;    2.8 MB/s (249999 of 249999 found)
      Testing 7.0 on 6.29:
      readrandom   :     229.430 micros/op 4357 ops/sec;    0.5 MB/s (43999 of 43999 found)
      Testing 7.0 on 7.0:
      readrandom   :      33.348 micros/op 29986 ops/sec;    3.3 MB/s (299999 of 299999 found)
      Testing 7.0 on fixed:
      readrandom   :     152.734 micros/op 6546 ops/sec;    0.7 MB/s (65999 of 65999 found)
      Testing fixed on 6.29:
      readrandom   :      32.024 micros/op 31224 ops/sec;    3.5 MB/s (312999 of 312999 found)
      Testing fixed on 7.0:
      readrandom   :      33.990 micros/op 29390 ops/sec;    3.3 MB/s (294999 of 294999 found)
      Testing fixed on fixed:
      readrandom   :      28.714 micros/op 34825 ops/sec;    3.9 MB/s (348999 of 348999 found)
      ```
      
      Just paying attention to order of magnitude of ops/sec (short test
      durations, lots of noise), it's clear that with the fix we can read <= 6.29
      & >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29
      release can properly read fixed DB at full speed.
      
      Reviewed By: siying, ajkr
      
      Differential Revision: D35057844
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050
      91687d70
  24. 02 3月, 2022 1 次提交
  25. 12 2月, 2022 1 次提交
    • A
      Fix failure in c_test (#9547) · 5c53b900
      Akanksha Mahajan 提交于
      Summary:
      When tests are run with TMPD, c_test may fail because TMPD
      is not created by the test. It results in IO error: No such file
      or directory: While mkdir if missing:
      /tmp/rocksdb_test_tmp/rocksdb_c_test-0: No such file or directory
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9547
      
      Test Plan:
      make -j32 c_test;
       TEST_TMPDIR=/tmp/rocksdb_test  ./c_test
      
      Reviewed By: riversand963
      
      Differential Revision: D34173298
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5b5a01f5b842c2487b05b0708c8e9532241db7f8
      5c53b900
  26. 09 2月, 2022 1 次提交
    • P
      FilterPolicy API changes for 7.0 (#9501) · 68a9c186
      Peter Dillinger 提交于
      Summary:
      * Inefficient block-based filter is no longer customizable in the public
      API, though (for now) can still be enabled.
        * Removed deprecated FilterPolicy::CreateFilter() and
        FilterPolicy::KeyMayMatch()
        * Removed `rocksdb_filterpolicy_create()` from C API
      * Change meaning of nullptr return from GetBuilderWithContext() from "use
      block-based filter" to "generate no filter in this case." This is a
      cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250.
        * Also, when user specifies bits_per_key < 0.5, we now round this down
        to "no filter" because we expect a filter with >= 80% FP rate is
        unlikely to be worth the CPU cost of accessing it (esp with
        cache_index_and_filter_blocks=1 or partition_filters=1).
        * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
        rate)
        * This also gives us some support for configuring filters from OPTIONS
        file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`.
        Opening from such an options file will enable reading filters (an
        improvement) but not writing new ones. (See Customizable follow-up
        below.)
      * Also removed deprecated functions
        * FilterBitsBuilder::CalculateNumEntry()
        * FilterPolicy::GetFilterBitsBuilder()
        * NewExperimentalRibbonFilterPolicy()
      * Remove default implementations of
        * FilterBitsBuilder::EstimateEntriesAdded()
        * FilterBitsBuilder::ApproximateNumEntries()
        * FilterPolicy::GetBuilderWithContext()
      * Remove support for "filter_policy=experimental_ribbon" configuration
      string.
      * Allow "filter_policy=bloomfilter:n" without bool to discourage use of
      block-based filter.
      
      Some pieces for https://github.com/facebook/rocksdb/issues/9389
      
      Likely follow-up (later PRs):
      * Refactoring toward FilterPolicy Customizable, so that we can generate
      filters with same configuration as before when configuring from options
      file.
      * Remove support for user enabling block-based filter (ignore `bool
      use_block_based_builder`)
        * Some months after this change, we could even remove read support for
        block-based filter, because it is not critical to DB data
        preservation.
      * Make FilterBitsBuilder::FinishV2 to avoid `using
      FilterBitsBuilder::Finish` mess and add support for specifying a
      MemoryAllocator (for cache warming)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501
      
      Test Plan:
      A number of obsolete tests deleted and new tests or test
      cases added or updated.
      
      Reviewed By: hx235
      
      Differential Revision: D34008011
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa
      68a9c186
  27. 04 2月, 2022 1 次提交
  28. 29 1月, 2022 1 次提交
    • H
      Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds (#9455) · 42cca28e
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
      - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
      - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
      
      Test Plan: Rely on my eyeball and CI
      
      Reviewed By: ajkr
      
      Differential Revision: D33811664
      
      Pulled By: hx235
      
      fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
      42cca28e
  29. 28 1月, 2022 5 次提交
  30. 27 1月, 2022 2 次提交