1. 16 5月, 2019 7 次提交
  2. 15 5月, 2019 1 次提交
    • A
      Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301) · 6492430e
      anand76 提交于
      Summary:
      This PR has two fixes for crash test failures -
      1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values
      2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time
      
      Test -
      asan_crash and ubsan_crash tests
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301
      
      Differential Revision: D15337383
      
      Pulled By: anand1976
      
      fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6
      6492430e
  3. 14 5月, 2019 2 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
    • Y
      db_bench: fix hang on IO error (#5300) · 92c60547
      Yi Wu 提交于
      Summary:
      db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300
      
      Differential Revision: D15319945
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44
      92c60547
  4. 11 5月, 2019 3 次提交
    • Y
      Fix a race condition caused by unlocking db mutex (#5294) · e6260165
      Yanqin Jin 提交于
      Summary:
      Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows.
      
      Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong.
      
      The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used **or** the memtable list does not have unpicked memtables.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294
      
      Differential Revision: D15295297
      
      Pulled By: riversand963
      
      fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd
      e6260165
    • M
      Fix crash in BlockBasedTableIterator::Seek() (#5291) · 6a6aef25
      Mike Kolupaev 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/5256 broke it: `block_iter_.user_key()` may not be valid even if `block_iter_points_to_real_block_` is true. E.g. if there was an IO error or Status::Incomplete.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5291
      
      Differential Revision: D15273324
      
      Pulled By: al13n321
      
      fbshipit-source-id: 442e5b09f9884a58f92a6ac1ca93af719c219886
      6a6aef25
    • L
      Turn CachableEntry into a proper resource handle (#5252) · f0bf3bf3
      Levi Tamasi 提交于
      Summary:
      CachableEntry is used in a variety of contexts: it may refer to a cached
      object (i.e. an object in the block cache), an owned object, or an
      unowned object; also, in some cases (most notably with iterators), the
      responsibility of managing the pointed-to object gets handed off to
      another object. Each of the above scenarios have different implications
      for the lifecycle of the referenced object. For the most part, the patch
      does not change the lifecycle of managed objects; however, it makes
      these relationships explicit, and it also enables us to eliminate some
      hacks and accident-prone code around releasing cache handles and
      deleting/cleaning up objects. (The only places where the patch changes
      how an objects are managed are the partitions of partitioned indexes and
      filters.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252
      
      Differential Revision: D15101358
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be
      f0bf3bf3
  5. 10 5月, 2019 4 次提交
    • J
      Add C bindings for LowerThreadPoolIO/CPUPriority (#5285) · 6451673f
      Jelte Fennema 提交于
      Summary:
      There were no C bindings for lowering thread pool priority. This adds those.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285
      
      Differential Revision: D15290050
      
      Pulled By: siying
      
      fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a
      6451673f
    • S
      Merging iterator to avoid child iterator reseek for some cases (#5286) · 9fad3e21
      Siying Dong 提交于
      Summary:
      When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
      (1) the iterator represents imutable data
      (2) reseek() to a larger key than the current key
      (3) the current key of the child iterator is larger than the seek key
      because it is guaranteed that the result will fall into the same position.
      
      This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286
      
      Differential Revision: D15283635
      
      Pulled By: siying
      
      fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
      9fad3e21
    • A
      Fix bugs in FilePickerMultiGet (#5292) · 181bb43f
      anand76 提交于
      Summary:
      This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by -
      1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again
      2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly
      
      Test -
      asan_crash
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292
      
      Differential Revision: D15282530
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f
      181bb43f
    • S
      DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244) · 25d81e45
      Siying Dong 提交于
      Summary:
      Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
      
      We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
      
      In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
      
      Differential Revision: D15067257
      
      Pulled By: siying
      
      fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
      25d81e45
  6. 09 5月, 2019 1 次提交
  7. 08 5月, 2019 1 次提交
    • Z
      avoid updating index type during iterator creation (#5288) · eea1cad8
      Zhongyi Xie 提交于
      Summary:
      Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85).  The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288
      
      Differential Revision: D15252509
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e
      eea1cad8
  8. 07 5月, 2019 1 次提交
  9. 04 5月, 2019 3 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
    • Z
      multiget: fix memory issues due to vector auto resizing (#5279) · 5d27d65b
      Zhongyi Xie 提交于
      Summary:
      This PR fixes three memory issues found by ASAN
      * in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance.
      * Similar issue in construction of GetContext autovector in version_set.cc
      * In multiget_context.h use T[] specialization for unique_ptr that holds a char array
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279
      
      Differential Revision: D15202893
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d
      5d27d65b
    • Z
      fix implicit conversion error reported by clang check (#5277) · 3e994809
      Zhongyi Xie 提交于
      Summary:
      fix the following clang check errors
      ```
      tools/db_stress.cc:3609:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32]
          int num_keys = rand_keys.size();
              ~~~~~~~~   ~~~~~~~~~~^~~~~~
      tools/db_stress.cc:3888:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32]
          int num_keys = rand_keys.size();
              ~~~~~~~~   ~~~~~~~~~~^~~~~~
      2 errors generated.
      make: *** [tools/db_stress.o] Error 1
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5277
      
      Differential Revision: D15196620
      
      Pulled By: miasantreble
      
      fbshipit-source-id: d56b1420d4a9f1df875fc52877a5fbb342bc7cae
      3e994809
  10. 03 5月, 2019 1 次提交
  11. 02 5月, 2019 5 次提交
  12. 01 5月, 2019 5 次提交
  13. 30 4月, 2019 3 次提交
    • Y
      Disable pipelined write in atomic flush stress test (#5266) · 210b49ca
      Yanqin Jin 提交于
      Summary:
      Since currently pipelined write allows one thread to perform memtable writes
      while another thread is traversing the `flush_scheduler_`, it will cause an
      assertion failure in `FlushScheduler::Clear`. To unblock crash recoery tests,
      we temporarily disable pipelined write when atomic flush is enabled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5266
      
      Differential Revision: D15142285
      
      Pulled By: riversand963
      
      fbshipit-source-id: a0c20fe4ac543e08feaed602414f982054df7831
      210b49ca
    • T
      CMake has stock FindZLIB in upper case. (#5261) · 18864567
      Tongliang Liao 提交于
      Summary:
      More details in https://cmake.org/cmake/help/v3.14/module/FindZLIB.html
      
      This resolves the cmake config error of not finding `Findzlib` on Linux (CentOS 7 + cmake 3.14.3 + gcc-8).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5261
      
      Differential Revision: D15138052
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 2f4445f49a36c16e6f1e05c090018c02379c0de4
      18864567
    • Y
      Fix a bug when trigger atomic flush and close db (#5254) · 35e6ba73
      Yanqin Jin 提交于
      Summary:
      With atomic flush, RocksDB background flush will flush memtables of a column family up to the largest memtable id in the immutable memtable list. This can introduce a bug in the following scenario. A user thread inserts into a column family until the memtable is full and triggers a flush. This will add the column family to flush_scheduler_. Then the user thread writes another record to the column family. In the PreprocessWrite function, the user thread picks the column family from flush_scheduler_ and schedules a flush request. The flush request gaurantees to flush all the memtables up to the current largest memtable ID of the immutable memtable list. Then the user thread writes new data to the newly-created active memtable. After the write returns, the user thread closes the db. This can cause assertion failure when the background flush thread tries to install superversion for the column family. The solution is to not install flush results if the db has already set `shutting_down_` to true.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5254
      
      Differential Revision: D15124149
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0a667a41339dedb5a18bcb01b0bf11c275c04df0
      35e6ba73
  14. 27 4月, 2019 2 次提交
    • S
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri 提交于
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
    • M
      Fix ubsan failure in snapshot refresh (#5257) · 8c7eb598
      Maysam Yabandeh 提交于
      Summary:
      The newly added test CompactionJobTest.SnapshotRefresh sets the snapshot refresh period to 0 to stress the feature. This results into large number of refresh events, which in turn results into an UBSAN failure when a bitwise shift operand goes beyond the uint64_t size.
      The patch fixes that by simplifying the shift logic to be done only by 2 bits after each refresh. Furthermore it verifies that the shift operation does not result in decreasing the refresh period.
      
      Testing:
      COMPILE_WITH_UBSAN=1 make -j32 compaction_job_test
      ./compaction_job_test --gtest_filter=CompactionJobTest.SnapshotRefresh
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5257
      
      Differential Revision: D15106463
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f2718898ea7ba4fa9f7e87b70cf98fe647c0de80
      8c7eb598
  15. 26 4月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448