1. 16 6月, 2018 3 次提交
  2. 13 6月, 2018 1 次提交
    • S
      Fix regression bug of Prev() with upper bound (#3989) · d82f1421
      Siying Dong 提交于
      Summary:
      A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
        Seek(key);
        if (!Valid()) SeekToLast();
      Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
      This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().
      
      Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
      Closes https://github.com/facebook/rocksdb/pull/3989
      
      Differential Revision: D8385422
      
      Pulled By: siying
      
      fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
      d82f1421
  3. 07 6月, 2018 1 次提交
  4. 06 6月, 2018 2 次提交
  5. 05 6月, 2018 1 次提交
    • M
      Extend some tests to format_version=3 (#3942) · d0c38c0c
      Maysam Yabandeh 提交于
      Summary:
      format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
      Closes https://github.com/facebook/rocksdb/pull/3942
      
      Differential Revision: D8238413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
      d0c38c0c
  6. 02 6月, 2018 3 次提交
    • Z
      Fix test for rocksdb_lite: hide incompatible option kDirectIO · 50d7ac0e
      Zhongyi Xie 提交于
      Summary:
      Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build.
      Closes https://github.com/facebook/rocksdb/pull/3943
      
      Differential Revision: D8242757
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a
      50d7ac0e
    • A
      Copy Get() result when file reads use mmap · fea2b1df
      Andrew Kryczka 提交于
      Summary:
      For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
      
      For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
      
      This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
      
      This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
      Closes https://github.com/facebook/rocksdb/pull/3881
      
      Differential Revision: D8076288
      
      Pulled By: ajkr
      
      fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
      fea2b1df
    • S
      add c api rocksdb_sstfilewriter_file_size · 89b37081
      straw 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/3922
      
      Differential Revision: D8208528
      
      Pulled By: ajkr
      
      fbshipit-source-id: d384fe53cf526f2aadc7b79a423ce36dbd3ff224
      89b37081
  7. 01 6月, 2018 4 次提交
  8. 31 5月, 2018 1 次提交
  9. 30 5月, 2018 2 次提交
    • A
      Delete triggered compaction for universal style · a736255d
      Anand Ananthabhotla 提交于
      Summary:
      This is still WIP, but I'm hoping for early feedback on the overall approach.
      
      This patch implements deletion triggered compaction, which till now only
      worked for leveled, for universal style. SST files are marked for
      compaction by the CompactOnDeletionCollertor table property. This is
      expected to be used when free disk space is low and the user wants to
      reclaim space by deleting a bunch of keys. The deletions are expected to
      be dense. In such a situation, we want to avoid a full compaction due to
      its space overhead.
      
      The strategy used in this case is similar to leveled. We pick one file
      from the set of files marked for compaction. We then expand the inputs
      to a clean cut on the same level, and then pick overlapping files from
      the next non-mepty level. Picking files from the next level can cause
      the key range to expand, and we opportunistically expand inputs in the
      source level to include files wholly in this key range.
      
      The main side effect of this is that it breaks the property of no time
      range overlap between levels. This shouldn't break any functionality.
      Closes https://github.com/facebook/rocksdb/pull/3860
      
      Differential Revision: D8124397
      
      Pulled By: anand1976
      
      fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf
      a736255d
    • Y
      Fix compilation error when OPT="-DROCKSDB_LITE". · cf826de3
      Yanqin Jin 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/3917
      
      Differential Revision: D8187733
      
      Pulled By: riversand963
      
      fbshipit-source-id: e4aa179cd0791ca77167e357f99de9afd4aef910
      cf826de3
  10. 29 5月, 2018 1 次提交
  11. 26 5月, 2018 2 次提交
    • M
      Exclude seq from index keys · 402b7aa0
      Maysam Yabandeh 提交于
      Summary:
      Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
      The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
      Closes https://github.com/facebook/rocksdb/pull/3894
      
      Differential Revision: D8118775
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
      402b7aa0
    • Y
      Fix segfault caused by object premature destruction · aa53579d
      Yanqin Jin 提交于
      Summary:
      Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
      There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
      
      To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
      Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
      Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
      
      Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
      
      Test plan
      ```
      $ make -j16
      $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
      ```
      
      Expected behavior:
      All tests should pass.
      Closes https://github.com/facebook/rocksdb/pull/3898
      
      Differential Revision: D8149421
      
      Pulled By: riversand963
      
      fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
      aa53579d
  12. 25 5月, 2018 2 次提交
  13. 24 5月, 2018 1 次提交
  14. 23 5月, 2018 1 次提交
    • A
      Avoid sleep in DBTest.GroupCommitTest to fix flakiness · 7db721b9
      Andrew Kryczka 提交于
      Summary:
      DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
      Closes https://github.com/facebook/rocksdb/pull/3883
      
      Differential Revision: D8079429
      
      Pulled By: ajkr
      
      fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049
      7db721b9
  15. 22 5月, 2018 4 次提交
    • S
      PersistRocksDBOptions() to use WritableFileWriter · 3db1ada3
      Siying Dong 提交于
      Summary:
      By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering.
      Closes https://github.com/facebook/rocksdb/pull/3882
      
      Differential Revision: D8080673
      
      Pulled By: siying
      
      fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b
      3db1ada3
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
    • Y
      Update ColumnFamilyTest for multi-CF verification · 263ef52b
      Yanqin Jin 提交于
      Summary:
      Change `keys_` from `set<string>` to `vector<set<string>>` so that each column
      family's keys are stored in one set.
      
      ajkr When you have a chance, can you PTAL? Thanks!
      Closes https://github.com/facebook/rocksdb/pull/3871
      
      Differential Revision: D8056447
      
      Pulled By: riversand963
      
      fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386
      263ef52b
    • A
      Assert keys/values pinned by range deletion meta-block iterators · 7b655214
      Andrew Kryczka 提交于
      Summary:
      `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
      Closes https://github.com/facebook/rocksdb/pull/3875
      
      Differential Revision: D8063667
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
      7b655214
  16. 19 5月, 2018 1 次提交
  17. 18 5月, 2018 2 次提交
    • S
      Implement key shortening functions in ReverseBytewiseComparator · 17af09fc
      Siying Dong 提交于
      Summary:
      Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.
      
      Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
      Closes https://github.com/facebook/rocksdb/pull/3836
      
      Differential Revision: D7959762
      
      Pulled By: siying
      
      fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
      17af09fc
    • Z
      add override to virtual functions · 1d7ca20f
      Zhongyi Xie 提交于
      Summary:
      this will fix the failing clang_check test
      Closes https://github.com/facebook/rocksdb/pull/3868
      
      Differential Revision: D8050880
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6
      1d7ca20f
  18. 17 5月, 2018 2 次提交
    • M
      Change and clarify the relationship between Valid(), status() and Seek*() for... · 8bf555f4
      Mike Kolupaev 提交于
      Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
      
      Summary:
      Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
       * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
       * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
      
      However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
      
      This PR changes the convention to:
       * If status() is not ok, Valid() always returns false.
       * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
      
      This does sacrifice the two use cases listed above, but siying said it's ok.
      
      Overview of the changes:
       * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
       * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
       * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
      
      To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
      
      Iterators that didn't need changes:
       * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
       * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
      
      Iterators with changes (see inline comments for details):
       * DBIter - an overhaul:
          - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
          - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
          - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
          - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
          - It used to not reset status on seek for some types of errors.
          - Some simplifications and better comments.
          - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
       * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
       * ForwardIterator - changed to the new convention, also slightly simplified.
       * ForwardLevelIterator - fixed some bugs and simplified.
       * LevelIterator - simplified.
       * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
       * BlockBasedTableIterator - minor changes.
       * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
       * PlainTableIterator - some seeks used to not reset status.
       * CuckooTableIterator - tiny code cleanup.
       * ManagedIterator - fixed some bugs.
       * BaseDeltaIterator - changed to the new convention and fixed a bug.
       * BlobDBIterator - seeks used to not reset status.
       * KeyConvertingIterator - some small change.
      Closes https://github.com/facebook/rocksdb/pull/3810
      
      Differential Revision: D7888019
      
      Pulled By: al13n321
      
      fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
      8bf555f4
    • M
      Fix race condition between log_.erase and log_.back · 46fde6b6
      Maysam Yabandeh 提交于
      Summary:
      log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it.
      Fixes https://github.com/facebook/rocksdb/issues/3852
      Closes https://github.com/facebook/rocksdb/pull/3859
      
      Differential Revision: D8026103
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28
      46fde6b6
  19. 15 5月, 2018 3 次提交
    • M
      Suppress tsan lock-order-inversion on FlushWAL · 12ad7112
      Maysam Yabandeh 提交于
      Summary:
      TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.
      
      The alternative would be to use
      ```
      TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
      ```
      but it does not seem straightforward to integrate it to our test infra.
      Closes https://github.com/facebook/rocksdb/pull/3854
      
      Differential Revision: D8000202
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802
      12ad7112
    • A
      Bottommost level-based compactions in bottom-pri pool · 3d7dc75b
      Andrew Kryczka 提交于
      Summary:
      This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
      Closes https://github.com/facebook/rocksdb/pull/3835
      
      Differential Revision: D7957179
      
      Pulled By: ajkr
      
      fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4
      3d7dc75b
    • M
      Pass manual_wal_flush also to the first wal file · 718c1c9c
      Maysam Yabandeh 提交于
      Summary:
      Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
      This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
      Closes https://github.com/facebook/rocksdb/pull/3824
      
      Differential Revision: D7909153
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
      718c1c9c
  20. 12 5月, 2018 1 次提交
  21. 10 5月, 2018 2 次提交
    • A
      Apply use_direct_io_for_flush_and_compaction to writes only · 072ae671
      Andrew Kryczka 提交于
      Summary:
      Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
      
      This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
      Closes https://github.com/facebook/rocksdb/pull/3829
      
      Differential Revision: D7915443
      
      Pulled By: ajkr
      
      fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
      072ae671
    • D
      Introduce and use the option to disable stall notifications structures · f92cd2fe
      Dmitri Smirnov 提交于
      Summary:
      and code. Removing this helps with insert performance.
      Closes https://github.com/facebook/rocksdb/pull/3830
      
      Differential Revision: D7921030
      
      Pulled By: siying
      
      fbshipit-source-id: 84e80d50a7ef96f5441c51c9a0d089c50217cce2
      f92cd2fe