1. 26 5月, 2022 3 次提交
  2. 25 5月, 2022 4 次提交
    • J
      Skip ZSTD dict tests if the version doesn't support it (#10046) · 23f34c7a
      Jay Zhuang 提交于
      Summary:
      For example, the default ZSTD version for ubuntu20 is 1.4.4, which will
      fail the test `PresetCompressionDict`:
      
      ```
      db/db_test_util.cc:607: Failure
      Invalid argument: zstd finalizeDictionary cannot be used because ZSTD 1.4.5+ is not linked with the binary.
      terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10046
      
      Test Plan: test pass with old zstd
      
      Reviewed By: cbi42
      
      Differential Revision: D36640067
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: b1c49fb7295f57f4515ce4eb3a52ae7d7e45da86
      23f34c7a
    • Y
      Add timestamp support to CompactedDBImpl (#10030) · d4081bf0
      Yu Zhang 提交于
      Summary:
      This PR is the second and last part for adding user defined timestamp support to read only DB. Specifically, the change in this PR includes:
      
      - `options.timestamp` respected by `CompactedDBImpl::Get` and `CompactedDBImpl::MultiGet` to return results visible up till that timestamp.
      - `CompactedDBImpl::Get(...,std::string* timestsamp)` and `CompactedDBImpl::MultiGet(std::vector<std::string>* timestamps)` return the timestamp(s) associated with the key(s).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10030
      
      Test Plan:
      ```
      $COMPILE_WITH_ASAN=1 make -j24 all
      $./db_readonly_with_timestamp_test --gtest_filter="DBReadOnlyTestWithTimestamp.CompactedDB*"
      $./db_basic_test --gtest_filter="DBBasicTest.CompactedDB*"
      $make all check
      ```
      
      Reviewed By: riversand963
      
      Differential Revision: D36613926
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5b7ed7fef822708c12e2caf7a8d2deb6a696f0f0
      d4081bf0
    • C
      Support read rate-limiting in SequentialFileReader (#9973) · 8515bd50
      Changyu Bi 提交于
      Summary:
      Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
      
      The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
      
      Test Plan:
      - `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
      - Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
        - Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
        - Benchmark:
      ```
      strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
      strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
      ```
      - db bench on backup and restore to ensure no performance regression.
        - backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
        - restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
      
      ```
      # Set up
      ./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
      
      # benchmark
      TEST_TMPDIR=/tmp/test_rocksdb
      NUM_RUN=50
      for ((j=0;j<$NUM_RUN;j++))
      do
         ./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
        # Restore
        #./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
      done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
      ```
      
      Reviewed By: hx235
      
      Differential Revision: D36327418
      
      Pulled By: cbi42
      
      fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
      8515bd50
    • J
      Fix failed VerifySstUniqueIds unittests (#10043) · fd24e447
      Jay Zhuang 提交于
      Summary:
      which should use UniqueId64x2 instead of string.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10043
      
      Test Plan: unittest
      
      Reviewed By: pdillinger
      
      Differential Revision: D36620366
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: cf937a1da362018472fa4396848225e48893848b
      fd24e447
  3. 24 5月, 2022 2 次提交
  4. 21 5月, 2022 3 次提交
    • A
      Seek parallelization (#9994) · 2db6a4a1
      Akanksha Mahajan 提交于
      Summary:
      The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
      This flow can be parallelized.
      
      Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
      - Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
      - If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
      - If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994
      
      Test Plan:
      - **Run Regressions.**
      ```
      ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
      ```
      i) Previous release 7.0 run for normal prefetching with async_io disabled:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.0
      Date:       Thu Mar 17 13:11:34 2022
      CPU:        24 * Intel Core Processor (Broadwell)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
      ```
      
      ii) normal prefetching after changes with async_io disable:
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
      Set seed to 1652922591315307 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:09:51 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations;  340.8 MB/s (249 of 249 found)
      ```
      iii) db_bench with async_io enabled completed succesfully
      
      ```
      ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
      Set seed to 1652924062021732 because --seed was 0
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 7.3
      Date:       Wed May 18 18:34:22 2022
      CPU:        32 * Intel Xeon Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     512 bytes each (256 bytes after compression)
      Entries:    5000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    2594.0 MB (estimated)
      FileSize:   1373.3 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: Snappy
      Compression sampling rate: 0
      Memtablerep: SkipListFactory
      Perf Level: 1
      ------------------------------------------------
      DB path: [/tmp/prefix_scan_prefetch_main]
      seekrandom   :  553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations;  293.6 MB/s (217 of 217 found)
      ```
      
      - db_stress with async_io disabled completed succesfully
      ```
       export CRASH_TEST_EXT_ARGS=" --async_io=0"
       make crash_test -j
      ```
      
      I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.
      
      Reviewed By: anand1976
      
      Differential Revision: D36459323
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c
      2db6a4a1
    • A
      Fix crash due to MultiGet async IO and direct IO (#10024) · e015206d
      anand76 提交于
      Summary:
      MultiGet with async IO is not officially supported with Posix yet. Avoid a crash by using synchronous MultiRead when direct IO is enabled.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10024
      
      Test Plan: Run db_crashtest.py manually
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36551053
      
      Pulled By: anand1976
      
      fbshipit-source-id: 72190418fa92dd0397e87825df618b12c9bdecda
      e015206d
    • C
      Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) · cc23b46d
      Changyu Bi 提交于
      Summary:
      An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
      
      Test Plan:
      #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
      Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
      ```
      # linked local ZSTD with version 1.5.2
      # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
      
      dict_bytes=16384
      train_bytes=1048576
      echo "========== No Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== Raw Content Dictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== FinalizeDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      echo "========== TrainDictionary =========="
      TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
      TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
      du -hc /dev/shm/dbbench/*sst | grep total
      
      # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
      # before compression data size: 1.2GB
      dict_bytes=16384
      max_dict_buffer_bytes =  1048576
                          space   cpu/memory
      No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
      Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
      FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
      TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
      ```
      
      #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
      FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
      ```
      dict_bytes=16384
      train_bytes=1048576
      
      for sst_file in `ls ../temp/myrock-sst/`
      do
        echo "********** $sst_file **********"
        echo "========== No Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
      
        echo "========== Raw Content Dictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
      
        echo "========== FinalizeDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
      
        echo "========== TrainDictionary =========="
        ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
      done
      
                               010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
      No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
      Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
      FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
      TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
      ```
      
      #### Decompression/Read test:
      With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
      ```
      dict_bytes=16384
      train_bytes=1048576
      echo "No Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
      
      echo "Raw Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
      
      echo "FinalizeDict"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
      
      echo "Train Dictionary"
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
      TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
      
      No Dictionary
      readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
      Raw Dictionary
      readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
      FinalizeDict
      readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
      Train Dictionary
      readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D35720026
      
      Pulled By: cbi42
      
      fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
      cc23b46d
  5. 20 5月, 2022 5 次提交
    • Y
      Add timestamp support to DBImplReadOnly (#10004) · 16bdb1f9
      Yu Zhang 提交于
      Summary:
      This PR adds timestamp support to a read only DB instance opened as `DBImplReadOnly`. A follow up PR will add the same support to `CompactedDBImpl`.
      
       With this, read only database has these timestamp related APIs:
      
      `ReadOptions.timestamp` : read should return the latest data visible to this specified timestamp
      `Iterator::timestamp()` : returns the timestamp associated with the key, value
      `DB:Get(..., std::string* timestamp)` : returns the timestamp associated with the key, value in `timestamp`
      
      Test plan (on devserver):
      
      ```
      $COMPILE_WITH_ASAN=1 make -j24 all
      $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.ReadOnlyDB*
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10004
      
      Reviewed By: riversand963
      
      Differential Revision: D36434422
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 5d949e65b1ffb845758000e2b310fdd4aae71cfb
      16bdb1f9
    • A
      Multi file concurrency in MultiGet using coroutines and async IO (#9968) · 57997dda
      anand76 提交于
      Summary:
      This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.
      
      A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.
      
      TODO:
      1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
      2. Do some stress testing with coroutines enabled
      
      No regression in synchronous MultiGet between this branch and main -
      ```
      ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
      ```
      Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```
      
      Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```
      
      More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.
      
      1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
      No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
      Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```
      
      2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
      No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
      Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```
      
      3. Single thread CPU bound workload with ~2 key overlap/file -
      No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
      Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```
      
      4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
      No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
      Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36348563
      
      Pulled By: anand1976
      
      fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
      57997dda
    • B
      Address comments for PR #9988 and #9996 (#10020) · 5be1579e
      Bo Wang 提交于
      Summary:
      1. The latest change of DecideRateLimiterPriority in https://github.com/facebook/rocksdb/pull/9988 is reverted.
      2. For https://github.com/facebook/rocksdb/blob/main/db/builder.cc#L345-L349
        2.1. Remove `we will regrad this verification as user reads` from the comments.
        2.2. `Do not set` the read_options.rate_limiter_priority to Env::IO_USER . Flush should be a background job.
        2.3. Update db_rate_limiter_test.cc.
      3. In IOOptions, mark `prio` as deprecated for future removal.
      4. In `file_system.h`, mark `IOPriority` as deprecated for future removal.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10020
      
      Test Plan: Unit tests.
      
      Reviewed By: ajkr
      
      Differential Revision: D36525317
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 011ba421822f8a124e6d25a2661c4e242df6ad36
      5be1579e
    • P
      Fix auto_prefix_mode performance with partitioned filters (#10012) · 280b9f37
      Peter Dillinger 提交于
      Summary:
      Essentially refactored the RangeMayExist implementation in
      FullFilterBlockReader to FilterBlockReaderCommon so that it applies to
      partitioned filters as well. (The function is not called for the
      block-based filter case.) RangeMayExist is essentially a series of checks
      around a possible PrefixMayExist, and I'm confident those checks should
      be the same for partitioned as for full filters. (I think it's likely
      that bugs remain in those checks, but this change is overall a simplifying
      one.)
      
      Added auto_prefix_mode support to db_bench
      
      Other small fixes as well
      
      Fixes https://github.com/facebook/rocksdb/issues/10003
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012
      
      Test Plan:
      Expanded unit test that uses statistics to check for filter
      optimization, fails without the production code changes here
      
      Performance: populate two DBs with
      ```
      TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
      TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters
      ```
      
      Observe no measurable change in non-partitioned performance
      ```
      TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
      ```
      Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec
      After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec
      
      Observe big improvement with partitioned (also supported by bloom use statistics)
      ```
      TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
      ```
      Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec
      After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec
      
      Reviewed By: siying
      
      Differential Revision: D36469796
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d
      280b9f37
    • J
      Track SST unique id in MANIFEST and verify (#9990) · c6d326d3
      Jay Zhuang 提交于
      Summary:
      Start tracking SST unique id in MANIFEST, which is used to verify with
      SST properties to make sure the SST file is not overwritten or
      misplaced. A DB option `try_verify_sst_unique_id` is introduced to
      enable/disable the verification, if enabled, it opens all SST files
      during DB-open to read the unique_id from table properties (default is
      false), so it's recommended to use it with `max_open_files = -1` to
      pre-open the files.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990
      
      Test Plan: unittests, format-compatible test, mini-crash
      
      Reviewed By: anand1976
      
      Differential Revision: D36381863
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
      c6d326d3
  6. 19 5月, 2022 2 次提交
    • G
      Set Read rate limiter priority dynamically and pass it to FS (#9996) · 4da34b97
      gitbw95 提交于
      Summary:
      ### Context:
      Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
      
      ### Solution
      User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
      
      **This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
      
      | State | Normal | Delayed | Stalled |
      | ----- | ------ | ------- | ------- |
      |  Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
      |  Compaction | IO_LOW  | IO_USER | IO_USER |
      |  User | User provided | User provided | User provided |
      
      We will respect the read_options that the user provided and will not set it.
      The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
      
      **Details**
      1. Set read_options.rate_limiter_priority dynamically:
      - User: Do not update the read_options. Use the read_options that the user provided.
      - Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
      - Flush: Update read_options in BuildTable().
      
      2. Pass the rate limiter priority to FSRandomAccessFile functions:
      - After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions,  including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
      - In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
      - Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
      
      ### Test Plans
      Add unit tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
      
      Reviewed By: anand1976
      
      Differential Revision: D36452483
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
      4da34b97
    • S
      Log error message when LinkFile() is not supported when ingesting files (#10010) · a74f14b5
      sdong 提交于
      Summary:
      Right now, whether moving file is skipped due to LinkFile() is not supported is opaque to users. Add a log message to help users debug.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10010
      
      Test Plan: Run existing test. Manual test verify the log message printed out.
      
      Reviewed By: riversand963
      
      Differential Revision: D36463237
      
      fbshipit-source-id: b00bd5041bd5c11afa4e326819c8461ee2c98a91
      a74f14b5
  7. 18 5月, 2022 4 次提交
    • G
      Set Write rate limiter priority dynamically and pass it to FS (#9988) · 05c678e1
      gitbw95 提交于
      Summary:
      ### Context:
      Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
      
      From the RocksDB perspective, there can be two kinds of rate limiters, the internal (native) one and the external one.
      - The internal (native) rate limiter is introduced in [the wiki](https://github.com/facebook/rocksdb/wiki/Rate-Limiter). Currently, only IO_LOW and IO_HIGH are used and they are set statically.
      - For the external rate limiter, in FSWritableFile functions,  IOOptions is open for end users to set and get rate_limiter_priority for their own rate limiter. Currently, RocksDB doesn’t pass the rate_limiter_priority through IOOptions to the file system.
      
      ### Solution
      During the User Read, Flush write, Compaction read/write, the WriteController is used to determine whether DB writes are stalled or slowed down. The rate limiter priority (Env::IOPriority) can be determined accordingly. We decided to always pass the priority in IOOptions. What the file system does with it should be a contract between the user and the file system. We would like to set the rate limiter priority at file level, since the Flush/Compaction job level may be too coarse with multiple files and block IO level is too granular.
      
      **This PR is for the Write path.** The **Write:** dynamic priority for different state are listed as follows:
      
      | State | Normal | Delayed | Stalled |
      | ----- | ------ | ------- | ------- |
      |  Flush | IO_HIGH | IO_USER | IO_USER |
      |  Compaction | IO_LOW | IO_USER | IO_USER |
      
      Flush and Compaction writes share the same call path through BlockBaseTableWriter, WritableFileWriter, and FSWritableFile. When a new FSWritableFile object is created, its io_priority_ can be set dynamically based on the state of the WriteController. In WritableFileWriter, before the call sites of FSWritableFile functions, WritableFileWriter::DecideRateLimiterPriority() determines the rate_limiter_priority. The options (IOOptions) argument of FSWritableFile functions will be updated with the rate_limiter_priority.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9988
      
      Test Plan: Add unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D36395159
      
      Pulled By: gitbw95
      
      fbshipit-source-id: a7c82fc29759139a1a07ec46c37dbf7e753474cf
      05c678e1
    • J
      Add table_properties_collector_factories override (#9995) · b84e3363
      Jay Zhuang 提交于
      Summary:
      Add table_properties_collector_factories override on the remote
      side.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9995
      
      Test Plan: unittest added
      
      Reviewed By: ajkr
      
      Differential Revision: D36392623
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 3ba031294d90247ca063d7de7b43178d38e3f66a
      b84e3363
    • H
      Rewrite memory-charging feature's option API (#9926) · 3573558e
      Hui Xiao 提交于
      Summary:
      **Context:**
      Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
      
      Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
      
      **Summary:**
      - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
      - Added missing db bench/stress test for some memory charging features
      - Renamed related test suite to indicate they are under the same theme of memory charging
      - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
      - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
      
      Test Plan:
      - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
      - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
      - CI
      - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR  -charge_compression_dictionary_building_buffer=1(remove this for comparison)  -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
      
      #-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
      -- | -- | -- | -- | -- | --
      10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
      20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
      40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
      
      - db_stress: `python3 tools/db_crashtest.py blackbox  -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
      
      Reviewed By: ajkr
      
      Differential Revision: D36054712
      
      Pulled By: hx235
      
      fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
      3573558e
    • M
      Use STATIC_AVOID_DESTRUCTION for static objects with non-trivial destructors (#9958) · b11ff347
      mrambacher 提交于
      Summary:
      Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958
      
      Reviewed By: pdillinger
      
      Differential Revision: D36442982
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519
      b11ff347
  8. 17 5月, 2022 1 次提交
  9. 14 5月, 2022 1 次提交
  10. 13 5月, 2022 2 次提交
    • M
      Option type info functions (#9411) · bfc6a8ee
      mrambacher 提交于
      Summary:
      Add methods to set the various functions (Parse, Serialize, Equals) to the OptionTypeInfo.  These methods simplify the number of constructors required for OptionTypeInfo and make the code a little clearer.
      
      Add functions to the OptionTypeInfo for Prepare and Validate.  These methods allow types other than Configurable and Customizable to have Prepare and Validate logic.  These methods could be used by an option to guarantee that its settings were in a range or that a value was initialized.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9411
      
      Reviewed By: pdillinger
      
      Differential Revision: D36174849
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 72517d8c6bab4723788a4c1a9e16590bff870125
      bfc6a8ee
    • Y
      Port the batched version of MultiGet() to RocksDB's C API (#9952) · bcb12872
      Yueh-Hsuan Chiang 提交于
      Summary:
      The batched version of MultiGet() is not available in RocksDB's C API.
      This PR implements rocksdb_batched_multi_get_cf which is a C wrapper function
      that invokes the batched version of MultiGet() which takes one single column family.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9952
      
      Test Plan: Added a new test case under "columnfamilies" test case in c_test.cc
      
      Reviewed By: riversand963
      
      Differential Revision: D36302888
      
      Pulled By: ajkr
      
      fbshipit-source-id: fa134c4a1c8e7d72dd4ae8649a74e3797b5cf4e6
      bcb12872
  11. 12 5月, 2022 1 次提交
    • A
      Update WAL corruption test so that it fails without fix (#9942) · 6442a62e
      Akanksha Mahajan 提交于
      Summary:
      In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
      flush the data from WAL to L0 for all column families if possible. As a
      result, not all column families can increase their log_numbers, and
      min_log_number_to_keep won't change.
      For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
      If we persist a new MANIFEST with
      advanced log_numbers for some column families, then during a second
      crash after persisting the MANIFEST, RocksDB will see some column
      families' log_numbers larger than the corrupted WAL, and the "column family inconsistency" error will be hit, causing recovery to fail.
      
      This PR update unit tests to emulate the errors and tests are failing without a fix.
      
      Error:
      ```
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0
      db/corruption_test.cc:1190: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF test_cf
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0, where GetParam() = (true, false) (91 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1
      db/corruption_test.cc:1190: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF test_cf
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1, where GetParam() = (false, false) (92 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2
      db/corruption_test.cc:1190: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF test_cf
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2, where GetParam() = (true, true) (95 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3
      db/corruption_test.cc:1190: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF test_cf
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3, where GetParam() = (false, true) (92 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0
      db/corruption_test.cc:1354: Failure
      TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0, where GetParam() = (true, false) (94 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1
      db/corruption_test.cc:1354: Failure
      TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1, where GetParam() = (false, false) (97 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2
      db/corruption_test.cc:1354: Failure
      TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2, where GetParam() = (true, true) (94 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3
      db/corruption_test.cc:1354: Failure
      TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3, where GetParam() = (false, true) (91 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0
      db/corruption_test.cc:1483: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0, where GetParam() = (true, false) (93 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1
      db/corruption_test.cc:1483: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1, where GetParam() = (false, false) (94 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2
      db/corruption_test.cc:1483: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2, where GetParam() = (true, true) (90 ms)
      [ RUN      ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3
      db/corruption_test.cc:1483: Failure
      DB::Open(options, dbname_, cf_descs, &handles, &db_)
      Corruption: SST file is ahead of WALs in CF default
      [  FAILED  ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3, where GetParam() = (false, true) (93 ms)
      [----------] 12 tests from CorruptionTest/CrashDuringRecoveryWithCorruptionTest (1116 ms total)
      
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9942
      
      Test Plan: Not needed
      
      Reviewed By: riversand963
      
      Differential Revision: D36324112
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: cab2075ac4ebe48f5ef93a6ea162558aa4fc334d
      6442a62e
  12. 10 5月, 2022 1 次提交
  13. 07 5月, 2022 1 次提交
    • S
      Remove own ToString() (#9955) · 736a7b54
      sdong 提交于
      Summary:
      ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
      
      Test Plan: Watch CI tests
      
      Reviewed By: riversand963
      
      Differential Revision: D36176799
      
      fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
      736a7b54
  14. 06 5月, 2022 2 次提交
    • O
      Fix various spelling errors still found in code (#9653) · b7aaa987
      Otto Kekäläinen 提交于
      Summary:
      dont -> don't
      refered -> referred
      
      This is a re-run of PR#7785 and acc9679c since these typos keep coming back.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9653
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D34879593
      
      fbshipit-source-id: d7631fb779ea0129beae92abfb838038e60790f8
      b7aaa987
    • S
      Use std::numeric_limits<> (#9954) · 49628c9a
      sdong 提交于
      Summary:
      Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
      
      Test Plan: See CI Runs.
      
      Reviewed By: riversand963
      
      Differential Revision: D36173954
      
      fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
      49628c9a
  15. 05 5月, 2022 3 次提交
    • Y
      Rename kRemoveWithSingleDelete to kPurge (#9951) · 9d634dd5
      Yanqin Jin 提交于
      Summary:
      PR 9929 adds a new CompactionFilter::Decision, i.e.
      kRemoveWithSingleDelete so that CompactionFilter can indicate to
      CompactionIterator that a PUT can only be removed with SD. However, how
      CompactionIterator handles such a key is implementation detail which
      should not be implied in the public API. In fact,
      such a PUT can just be dropped. This is an optimization which we will apply in the near future.
      
      Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D36156590
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372
      9d634dd5
    • S
      Printing IO Error in DumpDBFileSummary (#9940) · 68ac507f
      sdong 提交于
      Summary:
      Right now in DumpDBFileSummary, IO error isn't printed out, but they are sometimes helpful. Print it out instead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9940
      
      Test Plan: Watch existing tests to pass.
      
      Reviewed By: riversand963
      
      Differential Revision: D36113016
      
      fbshipit-source-id: 13002080fa4dc76589e2c1c5a1079df8a3c9391c
      68ac507f
    • J
      do not call DeleteFile for not-created sst files (#9920) · 95663ff7
      jsteemann 提交于
      Summary:
      When a memtable is flushed and the flush would lead to a 0 byte .sst
      file being created, RocksDB does not write out the empty .sst file to
      disk.
      However it still calls Env::DeleteFile() on the file as part of some
      cleanup procedure at the end of BuildTable().
      Because the to-be-deleted file does not exist, this requires
      implementors of the DeleteFile() API to check if the file exists on
      their own code, or otherwise risk running into PathNotFound errors when
      DeleteFile is invoked on non-existing files.
      This PR fixes the situation so that when no .sst file is created,
      Deletefile will not be called either.
      TableFileCreationStarted() will still be called as before.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9920
      
      Reviewed By: ajkr
      
      Differential Revision: D36107102
      
      Pulled By: riversand963
      
      fbshipit-source-id: 15881ba3fa3192dd448f906280a1cfc7a68a114a
      95663ff7
  16. 04 5月, 2022 1 次提交
    • P
      Fork and simplify LRUCache for developing enhancements (#9917) · bb87164d
      Peter Dillinger 提交于
      Summary:
      To support a project to prototype and evaluate algorithmic
      enhancments and alternatives to LRUCache, here I have separated out
      LRUCache into internal-only "FastLRUCache" and cut it down to
      essentials, so that details like secondary cache handling and
      priorities do not interfere with prototyping. These can be
      re-integrated later as needed, along with refactoring to minimize code
      duplication (which would slow down prototyping for now).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9917
      
      Test Plan:
      unit tests updated to ensure basic functionality has (likely)
      been preserved
      
      Reviewed By: anand1976
      
      Differential Revision: D35995554
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d67b20b7ada3b5d3bfe56d897a73885894a1d9db
      bb87164d
  17. 03 5月, 2022 2 次提交
    • Y
      Fix a bug of CompactionIterator/CompactionFilter using `Delete` (#9929) · 06394ff4
      Yanqin Jin 提交于
      Summary:
      When compaction filter determines that a key should be removed, it updates the internal key's type
      to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
      together with `SingleDelete`, it will cause compaction iterator to return Corruption.
      
      To fix the issue, compaction filter should return more information in addition to the intention of removing
      a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
      `kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
      
      In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
      reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
      when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
      
      Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
      not quite work with `MultiOpsTxnsStressTest`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
      
      Test Plan:
      make check
      make crash_test
      make crash_test_with_txn
      
      Reviewed By: anand1976
      
      Differential Revision: D36069678
      
      Pulled By: riversand963
      
      fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
      06394ff4
    • C
      Specify largest_seqno in VerifyChecksum (#9919) · 37f49083
      Changyu Bi 提交于
      Summary:
      `VerifyChecksum()` does not specify `largest_seqno` when creating a `TableReader`. As a result, the `TableReader` uses the `TableReaderOptions` default value (0) for `largest_seqno`. This causes the following error when the file has a nonzero global seqno in its properties:
      ```
      Corruption: An external sst file with version 2 have global seqno property with value , while largest seqno in the file is 0
      ```
      This PR fixes this by specifying `largest_seqno` in `VerifyChecksumInternal` with `largest_seqno` from the file metadata.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9919
      
      Test Plan: `make check`
      
      Reviewed By: ajkr
      
      Differential Revision: D36028824
      
      Pulled By: cbi42
      
      fbshipit-source-id: 428d028a79386f46ef97bb6b6051dc76c83e1f2b
      37f49083
  18. 29 4月, 2022 2 次提交
    • Y
      Enforce the contract of SingleDelete (#9888) · 2b5c29f9
      Yanqin Jin 提交于
      Summary:
      Enforce the contract of SingleDelete so that they are not mixed with
      Delete for the same key. Otherwise, it will lead to undefined behavior.
      See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes.
      
      Also fix unit tests and write-unprepared.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D35837817
      
      Pulled By: riversand963
      
      fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635
      2b5c29f9
    • A
      Update protection info on recovered logs data (#9875) · aafb377b
      Anvesh Komuravelli 提交于
      Summary:
      Update protection info on recovered logs data
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9875
      
      Test Plan:
      - Benchmark setup: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000`
      - Benchmark command: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true`
      - Results before this PR
      ```
      OpenDb:     2350.14 milliseconds
      OpenDb:     2296.94 milliseconds
      OpenDb:     2184.29 milliseconds
      OpenDb:     2167.59 milliseconds
      OpenDb:     2231.24 milliseconds
      OpenDb:     2109.57 milliseconds
      OpenDb:     2197.71 milliseconds
      OpenDb:     2120.8 milliseconds
      OpenDb:     2148.12 milliseconds
      OpenDb:     2207.95 milliseconds
      ```
      - Results after this PR
      ```
      OpenDb:     2424.52 milliseconds
      OpenDb:     2359.84 milliseconds
      OpenDb:     2317.68 milliseconds
      OpenDb:     2339.4 milliseconds
      OpenDb:     2325.36 milliseconds
      OpenDb:     2321.06 milliseconds
      OpenDb:     2353.98 milliseconds
      OpenDb:     2344.64 milliseconds
      OpenDb:     2384.09 milliseconds
      OpenDb:     2428.58 milliseconds
      ```
      
      Mean regressed 7.2% (2201.4 -> 2359.9)
      
      Reviewed By: ajkr
      
      Differential Revision: D36012787
      
      Pulled By: akomurav
      
      fbshipit-source-id: d2aba09f29c6beb2fd0fe8e1e359be910b4ef02a
      aafb377b