1. 19 7月, 2019 1 次提交
  2. 17 7月, 2019 1 次提交
    • L
      Move the filter readers out of the block cache (#5504) · 3bde41b5
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for the filter block, it is not
      really the block itself that is stored in the cache but a FilterBlockReader
      object. Since this object is not pure data (it has, for instance, pointers that
      might dangle, including in one case a back pointer to the TableReader), it's not
      really sharable. To avoid the issues around this, the current code erases the
      cache entries when the TableReader is closed (which, BTW, is not sufficient
      since a concurrent TableReader might have picked up the object in the meantime).
      Instead of doing this, the patch moves the FilterBlockReader out of the cache
      altogether, and decouples the filter reader object from the filter block.
      In particular, instead of the TableReader owning, or caching/pinning the
      FilterBlockReader (based on the customer's settings), with the change the
      TableReader unconditionally owns the FilterBlockReader, which in turn
      owns/caches/pins the filter block. This change also enables us to reuse the code
      paths historically used for data blocks for filters as well.
      
      Note:
      Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
      separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
      
      Test Plan: make asan_check
      
      Differential Revision: D16036974
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
      3bde41b5
  3. 10 7月, 2019 1 次提交
    • S
      Allow ldb to open DB as secondary (#5537) · aa0367aa
      sdong 提交于
      Summary:
      Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537
      
      Test Plan:
      Run
      ./ldb scan  --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex
      and
      ./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp
      against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench.
      Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench.
      
      Differential Revision: D16113886
      
      fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb
      aa0367aa
  4. 08 7月, 2019 1 次提交
    • Y
      Support GetAllKeyVersions() for non-default cf (#5544) · 7c76a7fb
      Yanqin Jin 提交于
      Summary:
      Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.
      
      Test plan (devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
      $./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
      ```
      All other unit tests must pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544
      
      Differential Revision: D16147551
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
      7c76a7fb
  5. 07 7月, 2019 1 次提交
  6. 04 7月, 2019 1 次提交
  7. 03 7月, 2019 1 次提交
  8. 01 7月, 2019 1 次提交
    • A
      MultiGet parallel IO (#5464) · 7259e28d
      anand76 提交于
      Summary:
      Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464
      
      Test Plan:
      1. make check
      2. make asan_check
      3. make asan_crash
      
      Differential Revision: D15911771
      
      Pulled By: anand1976
      
      fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
      7259e28d
  9. 28 6月, 2019 1 次提交
    • S
      LRU Cache to enable mid-point insertion by default (#5508) · 15fd3be0
      sdong 提交于
      Summary:
      Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508
      
      Test Plan: Run all existing tests.
      
      Differential Revision: D16021179
      
      fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb
      15fd3be0
  10. 27 6月, 2019 1 次提交
  11. 25 6月, 2019 1 次提交
    • M
      Add an option to put first key of each sst block in the index (#5289) · b4d72094
      Mike Kolupaev 提交于
      Summary:
      The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
      
      Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
      
      So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
      
      Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
      
      This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
      
      Differential Revision: D15256423
      
      Pulled By: al13n321
      
      fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
      b4d72094
  12. 22 6月, 2019 1 次提交
  13. 19 6月, 2019 3 次提交
    • S
      Replace Corruption with TryAgain status when new tail is not visible to... · fe90ed7a
      Simon Grätzer 提交于
      Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474)
      
      Summary:
      When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail.
      Fixes https://github.com/facebook/rocksdb/issues/5455
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474
      
      Differential Revision: D15898953
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc
      fe90ed7a
    • L
      Make the 'block read count' performance counters consistent (#5484) · 5355e527
      Levi Tamasi 提交于
      Summary:
      The patch brings the semantics of per-block-type read performance
      context counters in sync with the generic block_read_count by only
      incrementing the counter if the block was actually read from the file.
      It also fixes index_block_read_count, which fell victim to the
      refactoring in PR https://github.com/facebook/rocksdb/issues/5298.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484
      
      Test Plan: Extended the unit tests.
      
      Differential Revision: D15887431
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6
      5355e527
    • Y
      Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472) · f287f8dc
      Yanqin Jin 提交于
      Summary:
      While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
      The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.
      
      Test plan (on dev server)
      ```
      $make clean && make -j32 all
      $./db_secondary_test
      ```
      All existing unit tests must pass as well.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472
      
      Differential Revision: D15866771
      
      Pulled By: riversand963
      
      fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
      f287f8dc
  14. 15 6月, 2019 1 次提交
    • S
      Validate CF Options when creating a new column family (#5453) · f1219644
      Sagar Vemuri 提交于
      Summary:
      It seems like CF Options are not properly validated  when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations,  will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).
      
      **Test Plan:**
      Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
      ```
      TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
      ```
      Also ran gtest-parallel to make sure the new test is not flaky.
      ```
      TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
      [10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453
      
      Differential Revision: D15816851
      
      Pulled By: sagar0
      
      fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
      f1219644
  15. 12 6月, 2019 1 次提交
  16. 11 6月, 2019 1 次提交
    • Y
      Improve memtable earliest seqno assignment for secondary instance (#5413) · 6ce55808
      Yanqin Jin 提交于
      Summary:
      In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
      When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
      Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
      1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
      2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
      3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413
      
      Differential Revision: D15637407
      
      Pulled By: riversand963
      
      fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
      6ce55808
  17. 07 6月, 2019 1 次提交
    • L
      Refactor the handling of cache related counters and statistics (#5408) · bee2f48a
      Levi Tamasi 提交于
      Summary:
      The patch cleans up the handling of cache hit/miss/insertion related
      performance counters, get context counters, and statistics by
      eliminating some code duplication and factoring out the affected logic
      into separate methods. In addition, it makes the semantics of cache hit
      metrics more consistent by changing the code so that accessing a
      partition of partitioned indexes/filters through a pinned reference no
      longer counts as a cache hit.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408
      
      Differential Revision: D15610883
      
      Pulled By: ltamasi
      
      fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1
      bee2f48a
  18. 06 6月, 2019 1 次提交
    • Y
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin 提交于
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  19. 05 6月, 2019 1 次提交
  20. 01 6月, 2019 1 次提交
    • S
      Auto roll logger to enforce options.keep_log_file_num immediately after a new... · cb094e13
      Siying Dong 提交于
      Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370)
      
      Summary:
      Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370
      
      Differential Revision: D15570413
      
      Pulled By: siying
      
      fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e
      cb094e13
  21. 31 5月, 2019 2 次提交
    • Y
      Fix WAL replay by skipping old write batches (#5170) · b9f59006
      Yanqin Jin 提交于
      Summary:
      1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
      2. Add support for benchmarking secondary instance to db_bench_tool.
      With changes made in this PR, we can start benchmarking secondary instance
      using two processes. It is also possible to vary the frequency at which the
      secondary instance tries to catch up with the primary. The info log of the
      secondary can be found in a directory whose path can be specified with
      '-secondary_path'.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170
      
      Differential Revision: D15564608
      
      Pulled By: riversand963
      
      fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
      b9f59006
    • L
      Move the index readers out of the block cache (#5298) · 1e355842
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for index blocks as well, it is
      not really the index block that is stored in the cache but an
      IndexReader object. Since this object is not pure data (it has, for
      instance, pointers that might dangle), it's not really sharable. To
      avoid the issues around this, the current code uses a dummy unique cache
      key for each TableReader to store the IndexReader, and erases the
      IndexReader entry when the TableReader is closed. Instead of doing this,
      the new code moves the IndexReader out of the cache altogether. In
      particular, instead of the TableReader owning, or caching/pinning the
      IndexReader based on the customer's settings, the TableReader
      unconditionally owns the IndexReader, which in turn owns/caches/pins
      the index block (which is itself sharable and thus can be safely put in
      the cache without any hacks).
      
      Note: the change has two side effects:
      1) Partitions of partitioned indexes no longer affect the read
      amplification statistics.
      2) Eviction statistics for index blocks are temporarily broken. We plan to fix
      this in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
      
      Differential Revision: D15303203
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
      1e355842
  22. 24 5月, 2019 1 次提交
    • H
      Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333) · 74a334a2
      haoyuhuang 提交于
      Summary:
      RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named  'failed_move_fall_back_to_copy' for users to choose which behavior they want.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333
      
      Differential Revision: D15457597
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
      74a334a2
  23. 22 5月, 2019 1 次提交
    • S
      LogWriter to only flush after finish generating whole record (#5328) · b2274da0
      Siying Dong 提交于
      Summary:
      Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328
      
      Differential Revision: D15425032
      
      Pulled By: siying
      
      fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa
      b2274da0
  24. 21 5月, 2019 1 次提交
  25. 20 5月, 2019 1 次提交
    • M
      WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313) · 5c0e3041
      Maysam Yabandeh 提交于
      Summary:
      WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
      The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
      
      Differential Revision: D15379977
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
      5c0e3041
  26. 18 5月, 2019 2 次提交
    • Y
      Log replay integration for secondary instance (#5305) · fb4c6a31
      Yanqin Jin 提交于
      Summary:
      RocksDB secondary can replay both MANIFEST and WAL now.
      On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data.
      This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305
      
      Differential Revision: D15386512
      
      Pulled By: riversand963
      
      fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355
      fb4c6a31
    • Y
      Reduce iterator key comparison for upper/lower bound check (#5111) · f3a78475
      yiwu-arbug 提交于
      Summary:
      Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
      
      Differential Revision: D15330061
      
      Pulled By: siying
      
      fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
      f3a78475
  27. 16 5月, 2019 1 次提交
  28. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  29. 11 5月, 2019 1 次提交
    • Y
      Fix a race condition caused by unlocking db mutex (#5294) · e6260165
      Yanqin Jin 提交于
      Summary:
      Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows.
      
      Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong.
      
      The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used **or** the memtable list does not have unpicked memtables.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294
      
      Differential Revision: D15295297
      
      Pulled By: riversand963
      
      fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd
      e6260165
  30. 10 5月, 2019 2 次提交
    • S
      Merging iterator to avoid child iterator reseek for some cases (#5286) · 9fad3e21
      Siying Dong 提交于
      Summary:
      When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
      (1) the iterator represents imutable data
      (2) reseek() to a larger key than the current key
      (3) the current key of the child iterator is larger than the seek key
      because it is guaranteed that the result will fall into the same position.
      
      This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286
      
      Differential Revision: D15283635
      
      Pulled By: siying
      
      fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
      9fad3e21
    • S
      DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244) · 25d81e45
      Siying Dong 提交于
      Summary:
      Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
      
      We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
      
      In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
      
      Differential Revision: D15067257
      
      Pulled By: siying
      
      fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
      25d81e45
  31. 04 5月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
  32. 02 5月, 2019 3 次提交
    • S
      Reduce binary search when reseek into the same data block (#5256) · 4479dff2
      Siying Dong 提交于
      Summary:
      Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following:
      1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file
      2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256
      
      Differential Revision: D15105072
      
      Pulled By: siying
      
      fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80
      4479dff2
    • S
      DB::Close() to fail when there are unreleased snapshots (#5272) · 4e0f2aad
      Siying Dong 提交于
      Summary:
      Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272
      
      Differential Revision: D15159713
      
      Pulled By: siying
      
      fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd
      4e0f2aad
    • M
      Revert snap_refresh_nanos feature (#5269) · 521d234b
      Maysam Yabandeh 提交于
      Summary:
      Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269
      
      Differential Revision: D15151285
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
      521d234b
  33. 01 5月, 2019 1 次提交