1. 07 6月, 2019 1 次提交
    • L
      Refactor the handling of cache related counters and statistics (#5408) · bee2f48a
      Levi Tamasi 提交于
      Summary:
      The patch cleans up the handling of cache hit/miss/insertion related
      performance counters, get context counters, and statistics by
      eliminating some code duplication and factoring out the affected logic
      into separate methods. In addition, it makes the semantics of cache hit
      metrics more consistent by changing the code so that accessing a
      partition of partitioned indexes/filters through a pinned reference no
      longer counts as a cache hit.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408
      
      Differential Revision: D15610883
      
      Pulled By: ltamasi
      
      fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1
      bee2f48a
  2. 06 6月, 2019 1 次提交
    • Y
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin 提交于
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  3. 05 6月, 2019 1 次提交
  4. 01 6月, 2019 1 次提交
    • S
      Auto roll logger to enforce options.keep_log_file_num immediately after a new... · cb094e13
      Siying Dong 提交于
      Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370)
      
      Summary:
      Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370
      
      Differential Revision: D15570413
      
      Pulled By: siying
      
      fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e
      cb094e13
  5. 31 5月, 2019 2 次提交
    • Y
      Fix WAL replay by skipping old write batches (#5170) · b9f59006
      Yanqin Jin 提交于
      Summary:
      1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
      2. Add support for benchmarking secondary instance to db_bench_tool.
      With changes made in this PR, we can start benchmarking secondary instance
      using two processes. It is also possible to vary the frequency at which the
      secondary instance tries to catch up with the primary. The info log of the
      secondary can be found in a directory whose path can be specified with
      '-secondary_path'.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170
      
      Differential Revision: D15564608
      
      Pulled By: riversand963
      
      fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
      b9f59006
    • L
      Move the index readers out of the block cache (#5298) · 1e355842
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for index blocks as well, it is
      not really the index block that is stored in the cache but an
      IndexReader object. Since this object is not pure data (it has, for
      instance, pointers that might dangle), it's not really sharable. To
      avoid the issues around this, the current code uses a dummy unique cache
      key for each TableReader to store the IndexReader, and erases the
      IndexReader entry when the TableReader is closed. Instead of doing this,
      the new code moves the IndexReader out of the cache altogether. In
      particular, instead of the TableReader owning, or caching/pinning the
      IndexReader based on the customer's settings, the TableReader
      unconditionally owns the IndexReader, which in turn owns/caches/pins
      the index block (which is itself sharable and thus can be safely put in
      the cache without any hacks).
      
      Note: the change has two side effects:
      1) Partitions of partitioned indexes no longer affect the read
      amplification statistics.
      2) Eviction statistics for index blocks are temporarily broken. We plan to fix
      this in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
      
      Differential Revision: D15303203
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
      1e355842
  6. 24 5月, 2019 1 次提交
    • H
      Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333) · 74a334a2
      haoyuhuang 提交于
      Summary:
      RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named  'failed_move_fall_back_to_copy' for users to choose which behavior they want.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333
      
      Differential Revision: D15457597
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
      74a334a2
  7. 22 5月, 2019 1 次提交
    • S
      LogWriter to only flush after finish generating whole record (#5328) · b2274da0
      Siying Dong 提交于
      Summary:
      Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328
      
      Differential Revision: D15425032
      
      Pulled By: siying
      
      fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa
      b2274da0
  8. 21 5月, 2019 1 次提交
  9. 20 5月, 2019 1 次提交
    • M
      WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313) · 5c0e3041
      Maysam Yabandeh 提交于
      Summary:
      WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
      The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
      
      Differential Revision: D15379977
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
      5c0e3041
  10. 18 5月, 2019 2 次提交
    • Y
      Log replay integration for secondary instance (#5305) · fb4c6a31
      Yanqin Jin 提交于
      Summary:
      RocksDB secondary can replay both MANIFEST and WAL now.
      On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data.
      This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305
      
      Differential Revision: D15386512
      
      Pulled By: riversand963
      
      fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355
      fb4c6a31
    • Y
      Reduce iterator key comparison for upper/lower bound check (#5111) · f3a78475
      yiwu-arbug 提交于
      Summary:
      Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
      
      Differential Revision: D15330061
      
      Pulled By: siying
      
      fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
      f3a78475
  11. 16 5月, 2019 1 次提交
  12. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  13. 11 5月, 2019 1 次提交
    • Y
      Fix a race condition caused by unlocking db mutex (#5294) · e6260165
      Yanqin Jin 提交于
      Summary:
      Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows.
      
      Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong.
      
      The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used **or** the memtable list does not have unpicked memtables.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294
      
      Differential Revision: D15295297
      
      Pulled By: riversand963
      
      fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd
      e6260165
  14. 10 5月, 2019 2 次提交
    • S
      Merging iterator to avoid child iterator reseek for some cases (#5286) · 9fad3e21
      Siying Dong 提交于
      Summary:
      When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
      (1) the iterator represents imutable data
      (2) reseek() to a larger key than the current key
      (3) the current key of the child iterator is larger than the seek key
      because it is guaranteed that the result will fall into the same position.
      
      This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286
      
      Differential Revision: D15283635
      
      Pulled By: siying
      
      fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
      9fad3e21
    • S
      DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244) · 25d81e45
      Siying Dong 提交于
      Summary:
      Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
      
      We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
      
      In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
      
      Differential Revision: D15067257
      
      Pulled By: siying
      
      fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
      25d81e45
  15. 04 5月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
  16. 02 5月, 2019 3 次提交
    • S
      Reduce binary search when reseek into the same data block (#5256) · 4479dff2
      Siying Dong 提交于
      Summary:
      Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following:
      1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file
      2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256
      
      Differential Revision: D15105072
      
      Pulled By: siying
      
      fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80
      4479dff2
    • S
      DB::Close() to fail when there are unreleased snapshots (#5272) · 4e0f2aad
      Siying Dong 提交于
      Summary:
      Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272
      
      Differential Revision: D15159713
      
      Pulled By: siying
      
      fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd
      4e0f2aad
    • M
      Revert snap_refresh_nanos feature (#5269) · 521d234b
      Maysam Yabandeh 提交于
      Summary:
      Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269
      
      Differential Revision: D15151285
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
      521d234b
  17. 01 5月, 2019 1 次提交
  18. 27 4月, 2019 1 次提交
    • S
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri 提交于
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
  19. 26 4月, 2019 3 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
    • A
      Option string/map/file can set env from object registry (#5237) · 6eb317bb
      Andrew Kryczka 提交于
      Summary:
      - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
      - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
      - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
      
      Differential Revision: D15056360
      
      Pulled By: siying
      
      fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
      6eb317bb
    • Y
      Close WAL files before deletion (#5233) · da96f2fe
      Yanqin Jin 提交于
      Summary:
      Currently one thread in RocksDB keeps a WAL file open while another thread
      deletes it. Although the first thread never writes to the WAL again, it still
      tries to close it in the end. This is fine on POSIX, but can be problematic on
      other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
      throw exceptions. The solution is to let the second thread close the WAL before deleting it.
      
      RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
      
      Differential Revision: D15032670
      
      Pulled By: riversand963
      
      fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
      da96f2fe
  20. 25 4月, 2019 1 次提交
  21. 23 4月, 2019 1 次提交
    • A
      Optionally wait on bytes_per_sync to smooth I/O (#5183) · 8272a6de
      Andrew Kryczka 提交于
      Summary:
      The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
      
      My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
      
      Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
      
      There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
      
      The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
      
      Differential Revision: D14953553
      
      Pulled By: siying
      
      fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
      8272a6de
  22. 22 4月, 2019 1 次提交
    • M
      Add BlockBasedTableOptions::index_shortening (#5174) · df38c1ce
      Mike Kolupaev 提交于
      Summary:
      Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
      
      Differential Revision: D14884185
      
      Pulled By: al13n321
      
      fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
      df38c1ce
  23. 20 4月, 2019 1 次提交
  24. 19 4月, 2019 1 次提交
  25. 17 4月, 2019 3 次提交
    • Z
      Avoid double-compacting data in bottom level in manual compactions (#5138) · baa53024
      Zhongyi Xie 提交于
      Summary:
      Depending on the config, manual compaction (leveled compaction style) does following compactions:
      L0->L1
      L1->L2
      ...
      Ln-1 -> Ln
      Ln -> Ln
      The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
      Resolves issue https://github.com/facebook/rocksdb/issues/4995
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
      
      Differential Revision: D14940106
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
      baa53024
    • S
      WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175) · beb44ec3
      Siying Dong 提交于
      Summary:
      Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
      Change it to 256KB accordingly.
      There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175
      
      Differential Revision: D14954289
      
      Pulled By: siying
      
      fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
      beb44ec3
    • Y
      Avoid per-key upper bound check in BlockBasedTableIterator (#5142) · f1239d5f
      yiwu-arbug 提交于
      Summary:
      This is second attempt for #5101. Original commit message:
      `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
      
      This patch come with two fixes:
      
      Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
      
      Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
      
      Differential Revision: D14907113
      
      Pulled By: siying
      
      fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
      f1239d5f
  26. 16 4月, 2019 1 次提交
  27. 13 4月, 2019 2 次提交
    • Y
      Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190) · cca141ec
      yiwu-arbug 提交于
      Summary:
      Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190
      
      Differential Revision: D14923773
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
      cca141ec
    • M
      WritePrepared: fix race condition in reading batch with duplicate keys (#5147) · fe642cbe
      Maysam Yabandeh 提交于
      Summary:
      When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
      The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
      A unit  test is added to reproduce the race condition with duplicate keys.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
      
      Differential Revision: D14758815
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
      fe642cbe
  28. 12 4月, 2019 1 次提交
    • S
      Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165) · ed9f5e21
      Siying Dong 提交于
      Summary:
      Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
      Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
      
      Differential Revision: D14880709
      
      Pulled By: siying
      
      fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
      ed9f5e21
  29. 11 4月, 2019 1 次提交
    • S
      Periodic Compactions (#5166) · d3d20dcd
      Sagar Vemuri 提交于
      Summary:
      Introducing Periodic Compactions.
      
      This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.
      
      - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
      - This works across all levels.
      - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
      - Compaction filters, if any, are invoked as usual.
      - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
      
      This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
      
      Differential Revision: D14884441
      
      Pulled By: sagar0
      
      fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
      d3d20dcd
  30. 09 4月, 2019 1 次提交
    • J
      fix reading encrypted files beyond file boundaries (#5160) · 313e8772
      jsteemann 提交于
      Summary:
      This fix should help reading from encrypted files if the file-to-be-read
      is smaller than expected. For example, when using the encrypted env and
      making it read a journal file of exactly 0 bytes size, the encrypted env
      code crashes with SIGSEGV in its Decrypt function, as there is no check
      if the read attempts to read over the file's boundaries (as specified
      originally by the `dataSize` parameter).
      
      The most important problem this patch addresses is however that there is
      no size underlow check in `CTREncryptionProvider::CreateCipherStream`:
      
      The stream to be read will be initialized to a size of always
      `prefix.size() - (2 * blockSize)`. If the prefix however is smaller than
      twice the block size, this will obviously assume a _very_ large stream
      and read over the bounds. The patch adds a check here as follows:
      
          // If the prefix is smaller than twice the block size, we would below read a
          // very large chunk of the file (and very likely read over the bounds)
          assert(prefix.size() >= 2 * blockSize);
          if (prefix.size() < 2 * blockSize) {
            return Status::Corruption("Unable to read from file " + fname + ": read attempt would read beyond file bounds");
          }
      
      so embedders can catch the error in their release builds.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5160
      
      Differential Revision: D14834633
      
      Pulled By: sagar0
      
      fbshipit-source-id: 47aa39a6db8977252cede054c7eb9a663b9a3484
      313e8772