1. 16 6月, 2020 1 次提交
    • L
      Fix uninitialized memory read in table_test (#6980) · aa8f1331
      Levi Tamasi 提交于
      Summary:
      When using parameterized tests, `gtest` sometimes prints the test
      parameters. If no other printing method is available, it essentially
      produces a hex dump of the object. This can cause issues with valgrind
      with types like `TestArgs` in `table_test`, where the object layout has
      gaps (with uninitialized contents) due to the members' alignment
      requirements. The patch fixes the uninitialized reads by providing an
      `operator<<` for `TestArgs` and also makes sure all members are
      initialized (in a consistent order) on all code paths.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6980
      
      Test Plan: `valgrind --leak-check=full ./table_test`
      
      Reviewed By: siying
      
      Differential Revision: D22045536
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 6f5920ac28c712d0aa88162fffb80172ed769c32
      aa8f1331
  2. 13 6月, 2020 1 次提交
    • L
      Turn HarnessTest in table_test into a parameterized test (#6974) · bacd6edc
      Levi Tamasi 提交于
      Summary:
      `HarnessTest` in `table_test.cc` currently tests many parameter
      combinations sequentially in a loop. This is problematic from
      a testing perspective, since if the test fails, we have no way of
      knowing how many/which combinations have failed. It can also cause timeouts on
      our test system due to the sheer number of combinations tested.
      (Specifically, the parallel compression threads parameter added by
      https://github.com/facebook/rocksdb/pull/6262 seems to have been the last straw.)
      There is some DIY code there that splits the load among eight test cases
      but that does not appear to be sufficient anymore.
      
      Instead, the patch turns `HarnessTest` into a parameterized test, so all the
      parameter combinations can be tested separately and potentially
      concurrently. It also cleans up the tests a little, fixes
      `RandomizedLongDB`, which did not get updated when the parallel
      compression threads parameter was added, and turns `FooterTests` into a
      standalone test case (since it does not actually need a fixture class).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6974
      
      Test Plan: `make check`
      
      Reviewed By: siying
      
      Differential Revision: D22029572
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 51baea670771c33928f2eb3902bd69dcf540aa41
      bacd6edc
  3. 10 6月, 2020 1 次提交
  4. 08 6月, 2020 1 次提交
    • Y
      Remove unnecessary inclusion of version_edit.h in env (#6952) · 3020df9d
      Yanqin Jin 提交于
      Summary:
      In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.
      
      Test plan (devserver):
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D21925341
      
      Pulled By: riversand963
      
      fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
      3020df9d
  5. 06 6月, 2020 1 次提交
  6. 05 6月, 2020 1 次提交
  7. 04 6月, 2020 1 次提交
  8. 03 6月, 2020 1 次提交
    • P
      For ApproximateSizes, pro-rate table metadata size over data blocks (#6784) · 14eca6bf
      Peter Dillinger 提交于
      Summary:
      The implementation of GetApproximateSizes was inconsistent in
      its treatment of the size of non-data blocks of SST files, sometimes
      including and sometimes now. This was at its worst with large portion
      of table file used by filters and querying a small range that crossed
      a table boundary: the size estimate would include large filter size.
      
      It's conceivable that someone might want only to know the size in terms
      of data blocks, but I believe that's unlikely enough to ignore for now.
      Similarly, there's no evidence the internal function AppoximateOffsetOf
      is used for anything other than a one-sided ApproximateSize, so I intend
      to refactor to remove redundancy in a follow-up commit.
      
      So to fix this, GetApproximateSizes (and implementation details
      ApproximateSize and ApproximateOffsetOf) now consistently include in
      their returned sizes a portion of table file metadata (incl filters
      and indexes) based on the size portion of the data blocks in range. In
      other words, if a key range covers data blocks that are X% by size of all
      the table's data blocks, returned approximate size is X% of the total
      file size. It would technically be more accurate to attribute metadata
      based on number of keys, but that's not computationally efficient with
      data available and rarely a meaningful difference.
      
      Also includes miscellaneous comment improvements / clarifications.
      
      Also included is a new approximatesizerandom benchmark for db_bench.
      No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
      
      Test Plan:
      Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
      Old code running new test...
      
          [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
          db/db_test.cc:1562: Failure
          Expected: (size) <= (11 * 100), actual: 9478 vs 1100
      
      Other tests updated to reflect consistent accounting of metadata.
      
      Reviewed By: siying
      
      Differential Revision: D21334706
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
      14eca6bf
  9. 02 6月, 2020 1 次提交
  10. 21 5月, 2020 1 次提交
    • P
      Clean up some code related to file checksums (#6861) · c7aedf1b
      Peter Dillinger 提交于
      Summary:
      * Add missing unit test for schema stability of FileChecksumGenCrc32c
        (previously was only comparing to itself)
      * A lot of clarifying comments
      * Add some assertions for preconditions
      * Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum
      * Simplify FileChecksumGenCrc32c with shared functions
      * Implement EndianSwapValue to replace unused EndianTransform
      
      And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run,
      * Output full diagnostic information when 'make check-format' fails in CI
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861
      
      Test Plan: new unit test passes before & after other changes
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D21667115
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6
      c7aedf1b
  11. 13 5月, 2020 1 次提交
    • S
      sst_dump to reduce number of file reads (#6836) · 4a4b8a13
      sdong 提交于
      Summary:
      sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
      1. --readahead_size is added, so that users can specify readahead size when scanning the data.
      2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
      3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836
      
      Test Plan: Add a test that covers this new feature.
      
      Reviewed By: pdillinger
      
      Differential Revision: D21516607
      
      fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
      4a4b8a13
  12. 01 5月, 2020 1 次提交
    • A
      Pass a timeout to FileSystem for random reads (#6751) · ab13d43e
      anand76 提交于
      Summary:
      Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
      
      For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
      
      Tests:
      Update existing unit tests to verify the correct timeout value is being passed
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
      
      Reviewed By: riversand963
      
      Differential Revision: D21285631
      
      Pulled By: anand1976
      
      fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
      ab13d43e
  13. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  14. 04 4月, 2020 1 次提交
  15. 02 4月, 2020 1 次提交
    • Z
      Add pipelined & parallel compression optimization (#6262) · 03a781a9
      Ziyue Yang 提交于
      Summary:
      This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
      
      Reviewed By: ajkr
      
      Differential Revision: D20651306
      
      fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
      03a781a9
  16. 31 3月, 2020 1 次提交
  17. 30 3月, 2020 1 次提交
    • Z
      Use FileChecksumGenFactory for SST file checksum (#6600) · e8d332d9
      Zhichao Cao 提交于
      Summary:
      In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
      
      Test Plan: tested with make asan_check
      
      Reviewed By: riversand963
      
      Differential Revision: D20717670
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
      e8d332d9
  18. 18 3月, 2020 1 次提交
    • S
      Fix regression bug in partitioned index reseek caused by #6531 (#6551) · 712bc4b6
      sdong 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/6531 removed some code in partitioned index seek logic. By mistake the logic of storing previous index offset is removed, while the logic of using it is preserved, so that the code might use wrong value to determine reseeking condition.
      This will trigger a bug, if following a Seek() not going to the last block, SeekToLast() is called, and then Seek() is called which should position the cursor to the block before SeekToLast().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6551
      
      Test Plan: Add a unit test that reproduces the bug. In the same unit test, also some reseek cases are covered to avoid regression.
      
      Reviewed By: pdillinger
      
      Differential Revision: D20493990
      
      fbshipit-source-id: 3919aa4861c0481ec96844e053048da1a934b91d
      712bc4b6
  19. 07 3月, 2020 1 次提交
    • C
      Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455) · 0a0151fb
      Cheng Chang 提交于
      Summary:
      In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455
      
      Test Plan: make check
      
      Differential Revision: D20106753
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58
      0a0151fb
  20. 26 2月, 2020 1 次提交
    • A
      Fix range deletion tombstone ingestion with global seqno (#6429) · 69679e73
      Andrew Kryczka 提交于
      Summary:
      Original author: jeffrey-xiao
      
      If we are writing a global seqno for an ingested file, the range
      tombstone metablock gets accessed and put into the cache during
      ingestion preparation. At the time, the global seqno of the ingested
      file has not yet been determined, so the cached block will not have a
      global seqno. When the file is ingested and we read its range tombstone
      metablock, it will be returned from the cache with no global seqno. In
      that case, we use the actual seqnos stored in the range tombstones,
      which are all zero, so the tombstones cover nothing.
      
      This commit removes global_seqno_ variable from Block. When iterating
      over a block, the global seqno for the block is determined by the
      iterator instead of storing this mutable attribute in Block.
      Additionally, this commit adds a regression test to check that keys are
      deleted when ingesting a file with a global seqno and range deletion
      tombstones.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
      
      Differential Revision: D19961563
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
      69679e73
  21. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  22. 11 2月, 2020 1 次提交
    • Z
      Checksum for each SST file and stores in MANIFEST (#6216) · 4369f2c7
      Zhichao Cao 提交于
      Summary:
      In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
      
      Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
      
      Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
      
      Differential Revision: D19171461
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
      4369f2c7
  23. 16 1月, 2020 1 次提交
    • S
      Fix kHashSearch bug with SeekForPrev (#6297) · d2b4d42d
      sdong 提交于
      Summary:
      When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
      Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297
      
      Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.
      
      Differential Revision: D19404695
      
      fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
      d2b4d42d
  24. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  25. 17 9月, 2019 1 次提交
    • M
      Charge block cache for cache internal usage (#5797) · 638d2395
      Maysam Yabandeh 提交于
      Summary:
      For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
      This PR builds up on https://github.com/facebook/rocksdb/issues/4258
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797
      
      Test Plan:
      - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
      - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.
      
      Differential Revision: D17396833
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
      638d2395
  26. 10 9月, 2019 1 次提交
  27. 24 8月, 2019 1 次提交
    • Z
      Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe
      Zhongyi Xie 提交于
      Summary:
      MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
      We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
      The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
      In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
      
      Differential Revision: D14394062
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
      2f41ecfe
  28. 07 8月, 2019 1 次提交
    • V
      New API to get all merge operands for a Key (#5604) · d150e014
      Vijay Nadimpalli 提交于
      Summary:
      This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
      1. Update subset of columns and read subset of columns -
      Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
      2. Updating very few attributes in a value which is a JSON-like document -
      Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
      ----------------------------------------------------------------------------------------------------
      API :
      Status GetMergeOperands(
            const ReadOptions& options, ColumnFamilyHandle* column_family,
            const Slice& key, PinnableSlice* merge_operands,
            GetMergeOperandsOptions* get_merge_operands_options,
            int* number_of_operands)
      
      Example usage :
      int size = 100;
      int number_of_operands = 0;
      std::vector<PinnableSlice> values(size);
      GetMergeOperandsOptions merge_operands_info;
      db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
      
      Description :
      Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
      merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
      
      Test Plan:
      Added unit test and perf test in db_bench that can be run using the command:
      ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
      
      Differential Revision: D16657366
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
      d150e014
  29. 24 7月, 2019 1 次提交
    • L
      Move the uncompression dictionary object out of the block cache (#5584) · 092f4170
      Levi Tamasi 提交于
      Summary:
      RocksDB has historically stored uncompression dictionary objects in the block
      cache as opposed to storing just the block contents. This neccesitated
      evicting the object upon table close. With the new code, only the raw blocks
      are stored in the cache, eliminating the need for eviction.
      
      In addition, the patch makes the following improvements:
      
      1) Compression dictionary blocks are now prefetched/pinned similarly to
      index/filter blocks.
      2) A copy operation got eliminated when the uncompression dictionary is
      retrieved.
      3) Errors related to retrieving the uncompression dictionary are propagated as
      opposed to silently ignored.
      
      Note: the patch temporarily breaks the compression dictionary evicition stats.
      They will be fixed in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584
      
      Test Plan: make asan_check
      
      Differential Revision: D16344151
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
      092f4170
  30. 18 7月, 2019 1 次提交
  31. 17 7月, 2019 1 次提交
    • L
      Move the filter readers out of the block cache (#5504) · 3bde41b5
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for the filter block, it is not
      really the block itself that is stored in the cache but a FilterBlockReader
      object. Since this object is not pure data (it has, for instance, pointers that
      might dangle, including in one case a back pointer to the TableReader), it's not
      really sharable. To avoid the issues around this, the current code erases the
      cache entries when the TableReader is closed (which, BTW, is not sufficient
      since a concurrent TableReader might have picked up the object in the meantime).
      Instead of doing this, the patch moves the FilterBlockReader out of the cache
      altogether, and decouples the filter reader object from the filter block.
      In particular, instead of the TableReader owning, or caching/pinning the
      FilterBlockReader (based on the customer's settings), with the change the
      TableReader unconditionally owns the FilterBlockReader, which in turn
      owns/caches/pins the filter block. This change also enables us to reuse the code
      paths historically used for data blocks for filters as well.
      
      Note:
      Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
      separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
      
      Test Plan: make asan_check
      
      Differential Revision: D16036974
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
      3bde41b5
  32. 04 7月, 2019 1 次提交
  33. 25 6月, 2019 1 次提交
    • M
      Add an option to put first key of each sst block in the index (#5289) · b4d72094
      Mike Kolupaev 提交于
      Summary:
      The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
      
      Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
      
      So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
      
      Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
      
      This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
      
      Differential Revision: D15256423
      
      Pulled By: al13n321
      
      fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
      b4d72094
  34. 21 6月, 2019 1 次提交
    • H
      Add more callers for table reader. (#5454) · 705b8eec
      haoyuhuang 提交于
      Summary:
      This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
      1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
      2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
      
      This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
      
      Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
      
      Differential Revision: D15819451
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
      705b8eec
  35. 20 6月, 2019 1 次提交
  36. 19 6月, 2019 1 次提交
  37. 31 5月, 2019 4 次提交
    • S
      Move some memory related files from util/ to memory/ (#5382) · 8843129e
      Siying Dong 提交于
      Summary:
      Move arena, allocator, and memory tools under util to a separate memory/ directory.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382
      
      Differential Revision: D15564655
      
      Pulled By: siying
      
      fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
      8843129e
    • V
      Organizing rocksdb/table directory by format · 50e47079
      Vijay Nadimpalli 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373
      
      Differential Revision: D15559425
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
      50e47079
    • L
      Move the index readers out of the block cache (#5298) · 1e355842
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for index blocks as well, it is
      not really the index block that is stored in the cache but an
      IndexReader object. Since this object is not pure data (it has, for
      instance, pointers that might dangle), it's not really sharable. To
      avoid the issues around this, the current code uses a dummy unique cache
      key for each TableReader to store the IndexReader, and erases the
      IndexReader entry when the TableReader is closed. Instead of doing this,
      the new code moves the IndexReader out of the cache altogether. In
      particular, instead of the TableReader owning, or caching/pinning the
      IndexReader based on the customer's settings, the TableReader
      unconditionally owns the IndexReader, which in turn owns/caches/pins
      the index block (which is itself sharable and thus can be safely put in
      the cache without any hacks).
      
      Note: the change has two side effects:
      1) Partitions of partitioned indexes no longer affect the read
      amplification statistics.
      2) Eviction statistics for index blocks are temporarily broken. We plan to fix
      this in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
      
      Differential Revision: D15303203
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
      1e355842
    • S
      Move test related files under util/ to test_util/ (#5377) · e9e0101c
      Siying Dong 提交于
      Summary:
      There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
      ve them to a new directory test_util/, so that util/ is cleaner.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
      
      Differential Revision: D15551366
      
      Pulled By: siying
      
      fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
      e9e0101c