1. 09 9月, 2020 1 次提交
    • J
      Fix compile error for old gcc-4.8 (#7358) · 8a8a01c6
      Jay Zhuang 提交于
      Summary:
      gcc-4.8 returns error when using the constructor. Not sure if it's a compiler bug/limitation or code issue:
      ```
      table/block_based/block_based_table_reader.cc:3183:67: error: use of deleted function ‘rocksdb::WritableFileStringStreamAdapter::WritableFileStringStreamAdapter(rocksdb::WritableFileStringStreamAdapter&&)’
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7358
      
      Reviewed By: pdillinger
      
      Differential Revision: D23577651
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: b0197e3d3538da61a6f3866410d88d2047fb9695
      8a8a01c6
  2. 05 9月, 2020 1 次提交
  3. 28 8月, 2020 1 次提交
  4. 26 8月, 2020 1 次提交
    • S
      Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297) · 722814e3
      sdong 提交于
      Summary:
      Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
      
      Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
      
      Reviewed By: anand1976
      
      Differential Revision: D23257950
      
      fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
      722814e3
  5. 08 8月, 2020 1 次提交
  6. 08 7月, 2020 1 次提交
    • A
      Separate internal and user key comparators in `BlockIter` (#6944) · dd29ad42
      Andrew Kryczka 提交于
      Summary:
      Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944
      
      Test Plan:
      benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
      created by regular flush; "ingestion_db" created by ingesting a file. Both
      DBs have same contents.
      
      ```
      $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
      $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
      $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
      ```
      
      benchmark run command:
      
      ```
      $ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false
      ```
      
      results: perf improved marginally for ingestion_db and did not change significantly for normal_db:
      
      SEEK_NEXT | DB | code | ops/sec | % change
      -- | -- | -- | -- | --
      0 | normal_db | master | 350880 |  
      0 | normal_db | PR6944 | 351040 | 0.0
      0 | ingestion_db | master | 343255 |  
      0 | ingestion_db | PR6944 | 349424 | 1.8
      10 | normal_db | master | 218711 |  
      10 | normal_db | PR6944 | 217892 | -0.4
      10 | ingestion_db | master | 220334 |  
      10 | ingestion_db | PR6944 | 226437 | 2.8
      
      Reviewed By: pdillinger
      
      Differential Revision: D21924676
      
      Pulled By: ajkr
      
      fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538
      dd29ad42
  7. 30 6月, 2020 1 次提交
    • A
      Extend Get/MultiGet deadline support to table open (#6982) · 9a5886bd
      Anand Ananthabhotla 提交于
      Summary:
      Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
      
      The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
      
      Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
      
      Test Plan: Add new unit tests in db_basic_test
      
      Reviewed By: riversand963
      
      Differential Revision: D22219515
      
      Pulled By: anand1976
      
      fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
      9a5886bd
  8. 20 6月, 2020 1 次提交
    • P
      Fix block checksum for >=4GB, refactor (#6978) · 25a0d0ca
      Peter Dillinger 提交于
      Summary:
      Although RocksDB falls over in various other ways with KVs
      around 4GB or more, this change fixes how XXH32 and XXH64 were being
      called by the block checksum code to support >= 4GB in case that should
      ever happen, or the code copied for other uses.
      
      This change is not a schema compatibility issue because the checksum
      verification code would checksum the first (block_size + 1) mod 2^32
      bytes while the checksum construction code would checksum the first
      block_size mod 2^32 plus the compression type byte, meaning the
      XXH32/64 checksums for >=4GB block would not match about 255/256 times.
      
      While touching this code, I refactored to consolidate redundant
      implementations, improving diagnostics and performance tracking in some
      cases. Also used less confusing language in those diagnostics.
      
      Makes https://github.com/facebook/rocksdb/issues/6875 obsolete.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978
      
      Test Plan:
      I was able to write a test for this using an SST file writer
      and VerifyChecksum in a reader. The test fails before the fix, though
      I'm leaving the test disabled because I don't think it's worth the
      expense of running regularly.
      
      Reviewed By: gg814
      
      Differential Revision: D22143260
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e
      25a0d0ca
  9. 18 6月, 2020 1 次提交
  10. 14 6月, 2020 1 次提交
    • Z
      Fix persistent cache on windows (#6932) · 9c24a5cb
      Zhen Li 提交于
      Summary:
      Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932
      
      Test Plan:
      1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
      2. manually run db_bench.exe with "read_cache_path" and verified.
      
      Reviewed By: riversand963
      
      Differential Revision: D21911608
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
      9c24a5cb
  11. 10 6月, 2020 1 次提交
  12. 06 6月, 2020 1 次提交
    • A
      Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909) · 98b0cbea
      anand76 提交于
      Summary:
      The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.
      
      Tests:
      Add a test in block_based_table_reader_test.cc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909
      
      Reviewed By: pdillinger
      
      Differential Revision: D21833922
      
      Pulled By: anand1976
      
      fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
      98b0cbea
  13. 03 6月, 2020 2 次提交
    • P
      For ApproximateSizes, pro-rate table metadata size over data blocks (#6784) · 14eca6bf
      Peter Dillinger 提交于
      Summary:
      The implementation of GetApproximateSizes was inconsistent in
      its treatment of the size of non-data blocks of SST files, sometimes
      including and sometimes now. This was at its worst with large portion
      of table file used by filters and querying a small range that crossed
      a table boundary: the size estimate would include large filter size.
      
      It's conceivable that someone might want only to know the size in terms
      of data blocks, but I believe that's unlikely enough to ignore for now.
      Similarly, there's no evidence the internal function AppoximateOffsetOf
      is used for anything other than a one-sided ApproximateSize, so I intend
      to refactor to remove redundancy in a follow-up commit.
      
      So to fix this, GetApproximateSizes (and implementation details
      ApproximateSize and ApproximateOffsetOf) now consistently include in
      their returned sizes a portion of table file metadata (incl filters
      and indexes) based on the size portion of the data blocks in range. In
      other words, if a key range covers data blocks that are X% by size of all
      the table's data blocks, returned approximate size is X% of the total
      file size. It would technically be more accurate to attribute metadata
      based on number of keys, but that's not computationally efficient with
      data available and rarely a meaningful difference.
      
      Also includes miscellaneous comment improvements / clarifications.
      
      Also included is a new approximatesizerandom benchmark for db_bench.
      No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
      
      Test Plan:
      Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
      Old code running new test...
      
          [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
          db/db_test.cc:1562: Failure
          Expected: (size) <= (11 * 100), actual: 9478 vs 1100
      
      Other tests updated to reflect consistent accounting of metadata.
      
      Reviewed By: siying
      
      Differential Revision: D21334706
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
      14eca6bf
    • S
      Reduce dependency on gtest dependency in release code (#6907) · 298b00a3
      sdong 提交于
      Summary:
      Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead.
      Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907
      
      Test Plan: Run all existing tests.
      
      Reviewed By: ajkr
      
      Differential Revision: D21829884
      
      fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6
      298b00a3
  14. 02 6月, 2020 2 次提交
    • A
      Avoid unnecessary reads of uncompression dictionary in MultiGet (#6906) · 66942e81
      anand76 提交于
      Summary:
      We may sometimes read the uncompression dictionary when its not
      necessary, when we lookup a key in an SST file but the index indicates
      the key is not present. This can happen with index_type 3.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6906
      
      Test Plan: make check
      
      Reviewed By: cheng-chang
      
      Differential Revision: D21828944
      
      Pulled By: anand1976
      
      fbshipit-source-id: 7aef4f0a39548d0874eafefd2687006d2652f9bb
      66942e81
    • C
      Explicitly free allocated buffer when status is not ok (#6903) · bcb9e410
      Cheng Chang 提交于
      Summary:
      Currently we rely on `BlockContents` to implicitly free the allocated scratch buffer, but when IO error happens, it doesn't make sense to construct the `BlockContents` which might be corrupted. In the stress test, we find that `assert(req.result.size() == block_size(handle));` fails because of potential IO errors.
      
      In this PR, we explicitly free the scratch buffer on error without constructing `BlockContents`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6903
      
      Test Plan: watch stress test
      
      Reviewed By: anand1976
      
      Differential Revision: D21823869
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 5603fc80e9bf3f44a9d7250ddebd871afe1eb89f
      bcb9e410
  15. 27 5月, 2020 1 次提交
  16. 15 5月, 2020 1 次提交
    • C
      Enable IO Uring in MultiGet in direct IO mode (#6815) · 91b75532
      Cheng Chang 提交于
      Summary:
      Currently, in direct IO mode, `MultiGet` retrieves the data blocks one by one instead of in parallel, see `BlockBasedTable::RetrieveMultipleBlocks`.
      
      Since direct IO is supported in `RandomAccessFileReader::MultiRead` in https://github.com/facebook/rocksdb/pull/6446, this PR applies `MultiRead` to `MultiGet` so that the data blocks can be retrieved in parallel.
      
      Also, in direct IO mode and when data blocks are compressed and need to uncompressed, this PR only allocates one continuous aligned buffer to hold the data blocks, and then directly uncompress the blocks to insert into block cache, there is no longer intermediate copies to scratch buffers.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6815
      
      Test Plan:
      1. added a new unit test `BlockBasedTableReaderTest::MultiGet`.
      2. existing unit tests and stress tests  contain tests against `MultiGet` in direct IO mode.
      
      Reviewed By: anand1976
      
      Differential Revision: D21426347
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: b8446ae0e74152444ef9111e97f8e402ac31b24f
      91b75532
  17. 13 5月, 2020 1 次提交
    • S
      sst_dump to reduce number of file reads (#6836) · 4a4b8a13
      sdong 提交于
      Summary:
      sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
      1. --readahead_size is added, so that users can specify readahead size when scanning the data.
      2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
      3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836
      
      Test Plan: Add a test that covers this new feature.
      
      Reviewed By: pdillinger
      
      Differential Revision: D21516607
      
      fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
      4a4b8a13
  18. 08 5月, 2020 1 次提交
    • P
      Fix false NotFound from batched MultiGet with kHashSearch (#6821) · b27a1448
      Peter Dillinger 提交于
      Summary:
      The error is assigning KeyContext::s to NotFound status in a
      table reader for a "not found in this table" case, which skips searching
      in later tables, like only a delete should. (The hash search index iterator
      is the only one that can return status NotFound even if Valid() == false.)
      
      This was detected by intermittent failure in
      MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821
      
      Test Plan: modified existing unit test to reproduce problem
      
      Reviewed By: anand1976
      
      Differential Revision: D21450469
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
      b27a1448
  19. 01 5月, 2020 1 次提交
    • A
      Pass a timeout to FileSystem for random reads (#6751) · ab13d43e
      anand76 提交于
      Summary:
      Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
      
      For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
      
      Tests:
      Update existing unit tests to verify the correct timeout value is being passed
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
      
      Reviewed By: riversand963
      
      Differential Revision: D21285631
      
      Pulled By: anand1976
      
      fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
      ab13d43e
  20. 29 4月, 2020 1 次提交
    • P
      Basic MultiGet support for partitioned filters (#6757) · bae6f586
      Peter Dillinger 提交于
      Summary:
      In MultiGet, access each applicable filter partition only once
      per batch, rather than for each applicable key. Also,
      
      * Fix Bloom stats for MultiGet
      * Fix/refactor MultiGetContext::Range::KeysLeft, including
      * Add efficient BitsSetToOne implementation
      * Assert that MultiGetContext::Range does not go beyond shift range
      
      Performance test: Generate db:
      
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
          ...
      
      Before (middle performing run of three; note some missing Bloom stats):
      
          $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
          multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
          rocksdb.block.cache.filter.hit COUNT : 83443275
          rocksdb.bloom.filter.useful COUNT : 0
          rocksdb.bloom.filter.full.positive COUNT : 0
          rocksdb.bloom.filter.full.true.positive COUNT : 7931450
          rocksdb.number.multiget.get COUNT : 385984
          rocksdb.number.multiget.keys.read COUNT : 12351488
          rocksdb.number.multiget.bytes.read COUNT : 793145000
          rocksdb.number.multiget.keys.found COUNT : 7931450
      
      After (middle performing run of three):
      
          $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
          multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
          rocksdb.block.cache.filter.hit COUNT : 49856682
          rocksdb.bloom.filter.useful COUNT : 45684579
          rocksdb.bloom.filter.full.positive COUNT : 10395458
          rocksdb.bloom.filter.full.true.positive COUNT : 9908456
          rocksdb.number.multiget.get COUNT : 481984
          rocksdb.number.multiget.keys.read COUNT : 15423488
          rocksdb.number.multiget.bytes.read COUNT : 990845600
          rocksdb.number.multiget.keys.found COUNT : 9908456
      
      So that's about 25% higher throughput even for random keys
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757
      
      Test Plan: unit test included
      
      Reviewed By: anand1976
      
      Differential Revision: D21243256
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
      bae6f586
  21. 28 4月, 2020 1 次提交
    • P
      Stats for redundant insertions into block cache (#6681) · 249eff0f
      Peter Dillinger 提交于
      Summary:
      Since read threads do not coordinate on loading data into block
      cache, two threads between Lookup and Insert can end up loading and
      inserting the same data. This is particularly concerning with
      cache_index_and_filter_blocks since those are hot and more likely to
      be race targets if ejected from (or not pre-populated in) the cache.
      
      Particularly with moves toward disaggregated / network storage, the cost
      of redundant retrieval might be high, and we should at least have some
      hard statistics from which we can estimate impact.
      
      Example with full filter thrashing "cliff":
      
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
          ...
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 14181
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 476
          rocksdb.block.cache.data.add COUNT : 12749
          rocksdb.block.cache.data.add.redundant COUNT : 18
          rocksdb.block.cache.filter.add COUNT : 1003
          rocksdb.block.cache.filter.add.redundant COUNT : 217
          rocksdb.block.cache.index.add COUNT : 429
          rocksdb.block.cache.index.add.redundant COUNT : 241
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 1182223
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 302728
          rocksdb.block.cache.data.add COUNT : 31425
          rocksdb.block.cache.data.add.redundant COUNT : 12
          rocksdb.block.cache.filter.add COUNT : 795455
          rocksdb.block.cache.filter.add.redundant COUNT : 130238
          rocksdb.block.cache.index.add COUNT : 355343
          rocksdb.block.cache.index.add.redundant COUNT : 172478
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681
      
      Test Plan: Some manual testing (above) and unit test covering key metrics is included
      
      Reviewed By: ltamasi
      
      Differential Revision: D21134113
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
      249eff0f
  22. 25 4月, 2020 1 次提交
  23. 24 4月, 2020 1 次提交
  24. 22 4月, 2020 1 次提交
  25. 21 4月, 2020 1 次提交
    • P
      C++20 compatibility (#6697) · 31da5e34
      Peter Dillinger 提交于
      Summary:
      Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:
      
      * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
      * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
      * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
      * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
      * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
      * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
      * Added GCC 9 C++11 & GCC9 C++20 in Travis
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697
      
      Test Plan: make check and CI
      
      Reviewed By: cheng-chang
      
      Differential Revision: D21020318
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
      31da5e34
  26. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  27. 11 4月, 2020 1 次提交
    • A
      Fault injection in db_stress (#6538) · 5c19a441
      anand76 提交于
      Summary:
      This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538
      
      Test Plan:
      crash_test
      make check
      
      Reviewed By: riversand963
      
      Differential Revision: D20714347
      
      Pulled By: anand1976
      
      fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
      5c19a441
  28. 01 4月, 2020 1 次提交
  29. 27 3月, 2020 2 次提交
    • L
      Use function objects as deleters in the block cache (#6545) · 6301dbe7
      Levi Tamasi 提交于
      Summary:
      As the first step of reintroducing eviction statistics for the block
      cache, the patch switches from using simple function pointers as deleters
      to function objects implementing an interface. This will enable using
      deleters that have state, like a smart pointer to the statistics object
      that is to be updated when an entry is removed from the cache. For now,
      the patch adds a deleter template class `SimpleDeleter`, which simply
      casts the `value` pointer to its original type and calls `delete` or
      `delete[]` on it as appropriate. Note: to prevent object lifecycle
      issues, deleters must outlive the cache entries referring to them;
      `SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
      pattern.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545
      
      Test Plan: `make asan_check`
      
      Reviewed By: siying
      
      Differential Revision: D20475823
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
      6301dbe7
    • M
      Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562) · 963af52f
      Mike Kolupaev 提交于
      Summary:
      We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace:
      
      ```
      rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const
      rocksdb::BlockFetcher::ReadBlockContents()
      rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const
      rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const
      rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)
      rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const
      rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const
      rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const*, rocksdb::Slice const&, rocksdb::SliceTransform const*, rocksdb::Comparator const*, rocksdb::Slice const*, bool*, bool, rocksdb::BlockCacheLookupContext*)
      rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const*, bool, rocksdb::BlockCacheLookupContext*) const
      rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const*)
      rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool)
      rocksdb::DBIter::Seek(rocksdb::Slice const&)
      ```
      
      `BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562
      
      Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing.
      
      Reviewed By: siying
      
      Differential Revision: D20529368
      
      Pulled By: al13n321
      
      fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611
      963af52f
  30. 21 3月, 2020 1 次提交
    • C
      Support direct IO in RandomAccessFileReader::MultiRead (#6446) · 4fc21664
      Cheng Chang 提交于
      Summary:
      By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined.
      
      In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API.
      
      For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446
      
      Test Plan: Added a new test `random_access_file_reader_test.cc`.
      
      Reviewed By: anand1976
      
      Differential Revision: D20097518
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1
      4fc21664
  31. 17 3月, 2020 1 次提交
    • S
      De-template block based table iterator (#6531) · d6690809
      sdong 提交于
      Summary:
      Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
      1. Copy the block based iterator code into partitioned index iterator, and de-template them.
      2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
      3. Separate out the prefetch logic to a helper class and both classes call them.
      
      This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
      
      Test Plan: build using make and cmake. And build release
      
      Differential Revision: D20473108
      
      fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
      d6690809
  32. 13 3月, 2020 1 次提交
    • S
      Divide block_based_table_reader.cc (#6527) · 674cf417
      sdong 提交于
      Summary:
      block_based_table_reader.cc is a giant file, which makes it hard for users to navigate the code. Divide the files to multiple files.
      Some class templates cannot be moved to .cc file. They are moved to .h files. It is still better than including them all in block_based_table_reader.cc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6527
      
      Test Plan: "make all check" and "make release". Also build using cmake.
      
      Differential Revision: D20428455
      
      fbshipit-source-id: ca713c698469f07f35bc0c271358c0874ed4eb28
      674cf417
  33. 07 3月, 2020 1 次提交
    • Y
      Iterator with timestamp (#6255) · d93812c9
      Yanqin Jin 提交于
      Summary:
      Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
      
      Create an iterator with timestamp.
      ```
      ...
      read_opts.timestamp = &ts;
      auto* iter = db->NewIterator(read_opts);
      // target is key without timestamp.
      for (iter->Seek(target); iter->Valid(); iter->Next()) {}
      for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
      delete iter;
      read_opts.timestamp = &ts1;
      // lower_bound and upper_bound are without timestamp.
      read_opts.iterate_lower_bound = &lower_bound;
      read_opts.iterate_upper_bound = &upper_bound;
      auto* iter1 = db->NewIterator(read_opts);
      // Do Seek or SeekToFirst()
      delete iter1;
      ```
      
      Test plan (dev server)
      ```
      $make check
      ```
      
      Simple benchmarking (dev server)
      1. The overhead introduced by this PR even when timestamp is disabled.
      key size: 16 bytes
      value size: 100 bytes
      Entries: 1000000
      Data reside in main memory, and try to stress iterator.
      Repeated three times on master and this PR.
      - Seek without next
      ```
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
      ```
      master: 159047.0 ops/sec
      this PR: 158922.3 ops/sec (2% drop in throughput)
      - Seek and next 10 times
      ```
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
      ```
      master: 109539.3 ops/sec
      this PR: 107519.7 ops/sec (2% drop in throughput)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
      
      Differential Revision: D19438227
      
      Pulled By: riversand963
      
      fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
      d93812c9
  34. 26 2月, 2020 1 次提交
    • A
      Fix range deletion tombstone ingestion with global seqno (#6429) · 69679e73
      Andrew Kryczka 提交于
      Summary:
      Original author: jeffrey-xiao
      
      If we are writing a global seqno for an ingested file, the range
      tombstone metablock gets accessed and put into the cache during
      ingestion preparation. At the time, the global seqno of the ingested
      file has not yet been determined, so the cached block will not have a
      global seqno. When the file is ingested and we read its range tombstone
      metablock, it will be returned from the cache with no global seqno. In
      that case, we use the actual seqnos stored in the range tombstones,
      which are all zero, so the tombstones cover nothing.
      
      This commit removes global_seqno_ variable from Block. When iterating
      over a block, the global seqno for the block is determined by the
      iterator instead of storing this mutable attribute in Block.
      Additionally, this commit adds a regression test to check that keys are
      deleted when ingesting a file with a global seqno and range deletion
      tombstones.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
      
      Differential Revision: D19961563
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
      69679e73
  35. 22 2月, 2020 1 次提交
  36. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  37. 12 2月, 2020 1 次提交