1. 28 4月, 2020 1 次提交
    • P
      Stats for redundant insertions into block cache (#6681) · 249eff0f
      Peter Dillinger 提交于
      Summary:
      Since read threads do not coordinate on loading data into block
      cache, two threads between Lookup and Insert can end up loading and
      inserting the same data. This is particularly concerning with
      cache_index_and_filter_blocks since those are hot and more likely to
      be race targets if ejected from (or not pre-populated in) the cache.
      
      Particularly with moves toward disaggregated / network storage, the cost
      of redundant retrieval might be high, and we should at least have some
      hard statistics from which we can estimate impact.
      
      Example with full filter thrashing "cliff":
      
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
          ...
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 14181
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 476
          rocksdb.block.cache.data.add COUNT : 12749
          rocksdb.block.cache.data.add.redundant COUNT : 18
          rocksdb.block.cache.filter.add COUNT : 1003
          rocksdb.block.cache.filter.add.redundant COUNT : 217
          rocksdb.block.cache.index.add COUNT : 429
          rocksdb.block.cache.index.add.redundant COUNT : 241
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 1182223
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 302728
          rocksdb.block.cache.data.add COUNT : 31425
          rocksdb.block.cache.data.add.redundant COUNT : 12
          rocksdb.block.cache.filter.add COUNT : 795455
          rocksdb.block.cache.filter.add.redundant COUNT : 130238
          rocksdb.block.cache.index.add COUNT : 355343
          rocksdb.block.cache.index.add.redundant COUNT : 172478
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681
      
      Test Plan: Some manual testing (above) and unit test covering key metrics is included
      
      Reviewed By: ltamasi
      
      Differential Revision: D21134113
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
      249eff0f
  2. 25 4月, 2020 1 次提交
  3. 24 4月, 2020 1 次提交
  4. 22 4月, 2020 1 次提交
  5. 21 4月, 2020 1 次提交
    • P
      C++20 compatibility (#6697) · 31da5e34
      Peter Dillinger 提交于
      Summary:
      Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:
      
      * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
      * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
      * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
      * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
      * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
      * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
      * Added GCC 9 C++11 & GCC9 C++20 in Travis
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697
      
      Test Plan: make check and CI
      
      Reviewed By: cheng-chang
      
      Differential Revision: D21020318
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
      31da5e34
  6. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  7. 11 4月, 2020 1 次提交
    • A
      Fault injection in db_stress (#6538) · 5c19a441
      anand76 提交于
      Summary:
      This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538
      
      Test Plan:
      crash_test
      make check
      
      Reviewed By: riversand963
      
      Differential Revision: D20714347
      
      Pulled By: anand1976
      
      fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
      5c19a441
  8. 01 4月, 2020 1 次提交
  9. 27 3月, 2020 2 次提交
    • L
      Use function objects as deleters in the block cache (#6545) · 6301dbe7
      Levi Tamasi 提交于
      Summary:
      As the first step of reintroducing eviction statistics for the block
      cache, the patch switches from using simple function pointers as deleters
      to function objects implementing an interface. This will enable using
      deleters that have state, like a smart pointer to the statistics object
      that is to be updated when an entry is removed from the cache. For now,
      the patch adds a deleter template class `SimpleDeleter`, which simply
      casts the `value` pointer to its original type and calls `delete` or
      `delete[]` on it as appropriate. Note: to prevent object lifecycle
      issues, deleters must outlive the cache entries referring to them;
      `SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
      pattern.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545
      
      Test Plan: `make asan_check`
      
      Reviewed By: siying
      
      Differential Revision: D20475823
      
      Pulled By: ltamasi
      
      fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
      6301dbe7
    • M
      Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562) · 963af52f
      Mike Kolupaev 提交于
      Summary:
      We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace:
      
      ```
      rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const
      rocksdb::BlockFetcher::ReadBlockContents()
      rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const
      rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const
      rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)
      rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const
      rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const
      rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const*, rocksdb::Slice const&, rocksdb::SliceTransform const*, rocksdb::Comparator const*, rocksdb::Slice const*, bool*, bool, rocksdb::BlockCacheLookupContext*)
      rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const*, bool, rocksdb::BlockCacheLookupContext*) const
      rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const*)
      rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool)
      rocksdb::DBIter::Seek(rocksdb::Slice const&)
      ```
      
      `BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562
      
      Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing.
      
      Reviewed By: siying
      
      Differential Revision: D20529368
      
      Pulled By: al13n321
      
      fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611
      963af52f
  10. 21 3月, 2020 1 次提交
    • C
      Support direct IO in RandomAccessFileReader::MultiRead (#6446) · 4fc21664
      Cheng Chang 提交于
      Summary:
      By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined.
      
      In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API.
      
      For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446
      
      Test Plan: Added a new test `random_access_file_reader_test.cc`.
      
      Reviewed By: anand1976
      
      Differential Revision: D20097518
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1
      4fc21664
  11. 17 3月, 2020 1 次提交
    • S
      De-template block based table iterator (#6531) · d6690809
      sdong 提交于
      Summary:
      Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
      1. Copy the block based iterator code into partitioned index iterator, and de-template them.
      2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
      3. Separate out the prefetch logic to a helper class and both classes call them.
      
      This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
      
      Test Plan: build using make and cmake. And build release
      
      Differential Revision: D20473108
      
      fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
      d6690809
  12. 13 3月, 2020 1 次提交
    • S
      Divide block_based_table_reader.cc (#6527) · 674cf417
      sdong 提交于
      Summary:
      block_based_table_reader.cc is a giant file, which makes it hard for users to navigate the code. Divide the files to multiple files.
      Some class templates cannot be moved to .cc file. They are moved to .h files. It is still better than including them all in block_based_table_reader.cc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6527
      
      Test Plan: "make all check" and "make release". Also build using cmake.
      
      Differential Revision: D20428455
      
      fbshipit-source-id: ca713c698469f07f35bc0c271358c0874ed4eb28
      674cf417
  13. 07 3月, 2020 1 次提交
    • Y
      Iterator with timestamp (#6255) · d93812c9
      Yanqin Jin 提交于
      Summary:
      Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
      
      Create an iterator with timestamp.
      ```
      ...
      read_opts.timestamp = &ts;
      auto* iter = db->NewIterator(read_opts);
      // target is key without timestamp.
      for (iter->Seek(target); iter->Valid(); iter->Next()) {}
      for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
      delete iter;
      read_opts.timestamp = &ts1;
      // lower_bound and upper_bound are without timestamp.
      read_opts.iterate_lower_bound = &lower_bound;
      read_opts.iterate_upper_bound = &upper_bound;
      auto* iter1 = db->NewIterator(read_opts);
      // Do Seek or SeekToFirst()
      delete iter1;
      ```
      
      Test plan (dev server)
      ```
      $make check
      ```
      
      Simple benchmarking (dev server)
      1. The overhead introduced by this PR even when timestamp is disabled.
      key size: 16 bytes
      value size: 100 bytes
      Entries: 1000000
      Data reside in main memory, and try to stress iterator.
      Repeated three times on master and this PR.
      - Seek without next
      ```
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
      ```
      master: 159047.0 ops/sec
      this PR: 158922.3 ops/sec (2% drop in throughput)
      - Seek and next 10 times
      ```
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
      ```
      master: 109539.3 ops/sec
      this PR: 107519.7 ops/sec (2% drop in throughput)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
      
      Differential Revision: D19438227
      
      Pulled By: riversand963
      
      fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
      d93812c9
  14. 26 2月, 2020 1 次提交
    • A
      Fix range deletion tombstone ingestion with global seqno (#6429) · 69679e73
      Andrew Kryczka 提交于
      Summary:
      Original author: jeffrey-xiao
      
      If we are writing a global seqno for an ingested file, the range
      tombstone metablock gets accessed and put into the cache during
      ingestion preparation. At the time, the global seqno of the ingested
      file has not yet been determined, so the cached block will not have a
      global seqno. When the file is ingested and we read its range tombstone
      metablock, it will be returned from the cache with no global seqno. In
      that case, we use the actual seqnos stored in the range tombstones,
      which are all zero, so the tombstones cover nothing.
      
      This commit removes global_seqno_ variable from Block. When iterating
      over a block, the global seqno for the block is determined by the
      iterator instead of storing this mutable attribute in Block.
      Additionally, this commit adds a regression test to check that keys are
      deleted when ingesting a file with a global seqno and range deletion
      tombstones.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
      
      Differential Revision: D19961563
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
      69679e73
  15. 22 2月, 2020 1 次提交
  16. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  17. 12 2月, 2020 1 次提交
  18. 11 2月, 2020 1 次提交
  19. 08 2月, 2020 1 次提交
    • A
      Check KeyContext status in MultiGet (#6387) · d70011bc
      anand76 提交于
      Summary:
      Currently, any IO errors and checksum mismatches while reading data
      blocks, are being ignored by the batched MultiGet. Its only looking at
      the GetContext state. Fix that.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387
      
      Test Plan: Add unit tests
      
      Differential Revision: D19799819
      
      Pulled By: anand1976
      
      fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
      d70011bc
  20. 29 1月, 2020 1 次提交
    • S
      Add ReadOptions.auto_prefix_mode (#6314) · 8f2bee67
      sdong 提交于
      Summary:
      Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
      
      Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
      
      Differential Revision: D19458717
      
      fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
      8f2bee67
  21. 28 1月, 2020 1 次提交
  22. 17 1月, 2020 1 次提交
    • S
      Fix another bug caused by recent hash index fix (#6305) · d87cffae
      sdong 提交于
      Summary:
      Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305
      
      Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0
      
      Differential Revision: D19438925
      
      fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217
      d87cffae
  23. 16 1月, 2020 1 次提交
    • S
      Fix kHashSearch bug with SeekForPrev (#6297) · d2b4d42d
      sdong 提交于
      Summary:
      When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
      Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297
      
      Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.
      
      Differential Revision: D19404695
      
      fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
      d2b4d42d
  24. 20 12月, 2019 1 次提交
  25. 19 12月, 2019 1 次提交
    • S
      Prevent file prefetch when mmap is enabled. (#6206) · 02193ce4
      sdong 提交于
      Summary:
      Right now, sometimes file prefetching is still on when mmap is enabled. This causes bug of reading wrong data. In this commit, we remove all those possible paths.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6206
      
      Test Plan: make crash_test with compaction_readahead_size, which used to fail. RUn all existing tests.
      
      Differential Revision: D19149429
      
      fbshipit-source-id: 9e18ea8c566e416aac9647bdd05afe596634791b
      02193ce4
  26. 18 12月, 2019 1 次提交
  27. 17 12月, 2019 1 次提交
    • Z
      Merge adjacent file block reads in RocksDB MultiGet() and Add uncompressed block to cache (#6089) · cddd6379
      Zhichao Cao 提交于
      Summary:
      In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible.
      
      Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space.
      
      Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089
      
      Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check
      
      Differential Revision: D18734668
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4
      cddd6379
  28. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  29. 12 11月, 2019 1 次提交
    • A
      Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014) · 03ce7fb2
      anand76 提交于
      Summary:
      The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014
      
      Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.
      
      Differential Revision: D18412753
      
      Pulled By: anand1976
      
      fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
      03ce7fb2
  30. 08 11月, 2019 1 次提交
  31. 06 11月, 2019 1 次提交
  32. 22 10月, 2019 1 次提交
    • S
      Fix VerifyChecksum readahead with mmap mode (#5945) · 30e2dc02
      sdong 提交于
      Summary:
      A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap
       mode.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945
      
      Test Plan: Add a unit test that used to fail.
      
      Differential Revision: D18021443
      
      fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff
      30e2dc02
  33. 19 10月, 2019 1 次提交
    • L
      Store the filter bits reader alongside the filter block contents (#5936) · 29ccf207
      Levi Tamasi 提交于
      Summary:
      Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
      only the filter block contents are stored in the block cache (as opposed to the
      earlier design where the cache stored the filter block reader itself, leading to
      potentially dangling pointers and concurrency bugs). However, this change
      introduced a performance hit since with the new code, the metadata fields are
      re-parsed upon every access. This patch reunites the block contents with the
      filter bits reader to eliminate this overhead; since this is still a self-contained
      pure data object, it is safe to store it in the cache. (Note: this is similar to how
      the zstd digest is handled.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936
      
      Test Plan:
      make asan_check
      
      filter_bench results for the old code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.7153
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.4258
        Single filter ns/op: 42.5974
        Random filter ns/op: 217.861
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.4217
        Single filter ns/op: 50.9855
        Random filter ns/op: 219.167
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5172
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 32.3556
        Single filter ns/op: 83.2239
        Random filter ns/op: 370.676
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.2265
        Single filter ns/op: 93.5651
        Random filter ns/op: 408.393
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      With the new code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 25.4285
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 31.0594
        Single filter ns/op: 43.8974
        Random filter ns/op: 226.075
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 31.0295
        Single filter ns/op: 50.3824
        Random filter ns/op: 226.805
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5308
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.2968
        Single filter ns/op: 58.6163
        Random filter ns/op: 291.434
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.1839
        Single filter ns/op: 66.9039
        Random filter ns/op: 292.828
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      Differential Revision: D17991712
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
      29ccf207
  34. 12 10月, 2019 1 次提交
    • A
      Fix block cache ID uniqueness for Windows builds (#5844) · b00761ee
      Andrew Kryczka 提交于
      Summary:
      Since we do not evict a file's blocks from block cache before that file
      is deleted, we require a file's cache ID prefix is both unique and
      non-reusable. However, the Windows functionality we were relying on only
      guaranteed uniqueness. That meant a newly created file could be assigned
      the same cache ID prefix as a deleted file. If the newly created file
      had block offsets matching the deleted file, full cache keys could be
      exactly the same, resulting in obsolete data blocks returned from cache
      when trying to read from the new file.
      
      We noticed this when running on FAT32 where compaction was writing out
      of order keys due to reading obsolete blocks from its input files. The
      functionality is documented as behaving the same on NTFS, although I
      wasn't able to repro it there.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844
      
      Test Plan:
      we had a reliable repro of out-of-order keys on FAT32 that
      was fixed by this change
      
      Differential Revision: D17752442
      
      fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
      b00761ee
  35. 04 10月, 2019 1 次提交
    • A
      Fix data block upper bound checking for iterator reseek case (#5883) · 19a97dd1
      anand76 提交于
      Summary:
      When an iterator reseek happens with the user specifying a new iterate_upper_bound in ReadOptions, and the new seek position is at the end of the same data block, the Seek() ends up using a stale value of data_block_within_upper_bound_ and may return incorrect results.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5883
      
      Test Plan: Added a new test case DBIteratorTest.IterReseekNewUpperBound. Verified that it failed due to the assertion failure without the fix, and passes with the fix.
      
      Differential Revision: D17752740
      
      Pulled By: anand1976
      
      fbshipit-source-id: f9b635ff5d6aeb0e1bef102cf8b2f900efd378e3
      19a97dd1
  36. 21 9月, 2019 1 次提交
  37. 17 9月, 2019 1 次提交
    • S
      Divide file_reader_writer.h and .cc (#5803) · b931f84e
      sdong 提交于
      Summary:
      file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
      
      Test Plan: Build whole project using make and cmake.
      
      Differential Revision: D17374550
      
      fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
      b931f84e
  38. 23 8月, 2019 1 次提交
    • L
      Revert to storing UncompressionDicts in the cache (#5645) · df8c307d
      Levi Tamasi 提交于
      Summary:
      PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point
      of the digest is to create it once and reuse it over and over again. This patch goes back to
      storing the uncompression dictionary itself in the cache (which should be now safe to do,
      since it no longer includes a Statistics pointer), while preserving the rest of the refactoring.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645
      
      Test Plan: make asan_check
      
      Differential Revision: D16551864
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800
      df8c307d
  39. 22 8月, 2019 1 次提交
    • M
      Refactor MultiGet names in BlockBasedTable (#5726) · 244e6f20
      Maysam Yabandeh 提交于
      Summary:
      To improve code readability, since RetrieveBlock already calls MaybeReadBlockAndLoadToCache, we avoid name similarity of the functions that call RetrieveBlock with MaybeReadBlockAndLoadToCache. The patch thus renames MaybeLoadBlocksToCache to RetrieveMultipleBlock and deletes GetDataBlockFromCache, which contains only two lines.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5726
      
      Differential Revision: D16962535
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 99e8946808ce4eb7857592b9003812e3004f92d6
      244e6f20