1. 27 10月, 2018 1 次提交
  2. 23 10月, 2018 2 次提交
    • M
      Fix user comparator receiving internal key (#4575) · c34cc404
      Maysam Yabandeh 提交于
      Summary:
      There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575
      
      Differential Revision: D10500434
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
      c34cc404
    • S
      Dynamic level to adjust level multiplier when write is too heavy (#4338) · 70242636
      Siying Dong 提交于
      Summary:
      Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.
      
      We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338
      
      Differential Revision: D9636782
      
      Pulled By: siying
      
      fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
      70242636
  3. 20 10月, 2018 1 次提交
    • S
      Fix WriteBatchWithIndex's SeekForPrev() (#4559) · c17383f9
      Siying Dong 提交于
      Summary:
      WriteBatchWithIndex's SeekForPrev() has a bug that we internally place the position just before the seek key rather than after. This makes the iterator to miss the result that is the same as the seek key. Fix it by position the iterator equal or smaller.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4559
      
      Differential Revision: D10468534
      
      Pulled By: siying
      
      fbshipit-source-id: 2fb371ae809c561b60a1c11cef71e1c66fea1f19
      c17383f9
  4. 18 10月, 2018 1 次提交
    • Z
      Add PerfContextByLevel to provide per level perf context information (#4226) · d6ec2887
      Zhongyi Xie 提交于
      Summary:
      Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
      This will be helpful when analyzing point and range query performance as well as tuning bloom filter
      Also replaced __thread with thread_local keyword for perf_context
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226
      
      Differential Revision: D10369509
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
      d6ec2887
  5. 16 10月, 2018 2 次提交
    • A
      Properly determine a truncated CompactRange stop key (#4496) · 1e384580
      anand1976 提交于
      Summary:
      When a CompactRange() call for a level is truncated before the end key
      is reached, because it exceeds max_compaction_bytes, we need to properly
      set the compaction_end parameter to indicate the stop key. The next
      CompactRange will use that as the begin key. We set it to the smallest
      key of the next file in the level after expanding inputs to get a clean
      cut.
      
      Previously, we were setting it before expanding inputs. So we could end
      up recompacting some files. In a pathological case, where a single key
      has many entries spanning all the files in the level (possibly due to
      merge operands without a partial merge operator, thus resulting in
      compaction output identical to the input), this would result in
      an endless loop over the same set of files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496
      
      Differential Revision: D10395026
      
      Pulled By: anand1976
      
      fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
      1e384580
    • A
      Avoid per-key linear scan over snapshots in compaction (#4495) · 32b4d4ad
      Andrew Kryczka 提交于
      Summary:
      `CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495
      
      Differential Revision: D10386470
      
      Pulled By: ajkr
      
      fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b
      32b4d4ad
  6. 11 10月, 2018 1 次提交
  7. 10 10月, 2018 1 次提交
    • A
      Handle mixed slowdown/no_slowdown writer properly (#4475) · 854a4be0
      Anand Ananthabhotla 提交于
      Summary:
      There is a bug when the write queue leader is blocked on a write
      delay/stop, and the queue has writers with WriteOptions::no_slowdown set
      to true. They are not woken up until the write stall is cleared.
      
      The fix introduces a dummy writer inserted at the tail to indicate a
      write stall and prevent further inserts into the queue, and a condition
      variable that writers who can tolerate slowdown wait on before adding
      themselves to the queue. The leader calls WriteThread::BeginWriteStall()
      to add the dummy writer and then walk the queue to fail any writers with
      no_slowdown set. Once the stall clears, the leader calls
      WriteThread::EndWriteStall() to remove the dummy writer and signal the
      condition variable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475
      
      Differential Revision: D10285827
      
      Pulled By: anand1976
      
      fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
      854a4be0
  8. 09 10月, 2018 2 次提交
  9. 03 10月, 2018 1 次提交
  10. 01 10月, 2018 1 次提交
  11. 19 9月, 2018 1 次提交
  12. 11 9月, 2018 1 次提交
    • M
      Skip concurrency control during recovery of pessimistic txn (#4346) · 3f528226
      Maysam Yabandeh 提交于
      Summary:
      TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
      
      Differential Revision: D9759149
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
      3f528226
  13. 30 8月, 2018 1 次提交
    • M
      Avoiding write stall caused by manual flushes (#4297) · 927f2749
      Mikhail Antonov 提交于
      Summary:
      Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
      
      One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
      
      This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
      
      Differential Revision: D9420705
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
      927f2749
  14. 29 8月, 2018 1 次提交
    • A
      Sync CURRENT file during checkpoint (#4322) · 42733637
      Andrew Kryczka 提交于
      Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it.
      
      Differential Revision: D9525939
      
      Pulled By: ajkr
      
      fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3
      42733637
  15. 25 8月, 2018 1 次提交
    • A
      Reduce empty SST creation/deletion during compaction (#4311) · 17f9a181
      Andrew Kryczka 提交于
      Summary:
      I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release.
      
      This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311
      
      Differential Revision: D9493734
      
      Pulled By: ajkr
      
      fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8
      17f9a181
  16. 24 8月, 2018 1 次提交
  17. 23 8月, 2018 1 次提交
  18. 22 8月, 2018 1 次提交
  19. 17 8月, 2018 2 次提交
  20. 16 8月, 2018 1 次提交
    • F
      Improve point-lookup performance using a data block hash index (#4174) · 19ec44fd
      Fenggang Wu 提交于
      Summary:
      Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
      
      The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
      
      Differential Revision: D8965914
      
      Pulled By: fgwu
      
      fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
      19ec44fd
  21. 14 8月, 2018 1 次提交
    • Z
      RocksDB Trace Analyzer (#4091) · 999d955e
      Zhichao Cao 提交于
      Summary:
      A framework of trace analyzing for RocksDB
      
      After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
      **Input:**
      1. trace file
      2. Whole keys space file
      
      **Statistics:**
      1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
      2. Key hotness (access count) of each one
      3. Key space separation based on given prefix
      4. Key size distribution
      5. Value size distribution if appliable
      6. Top K accessed keys
      7. QPS statistics including the average QPS and peak QPS
      8. Top K accessed prefix
      9. The query correlation analyzing, output the number of X after Y and the corresponding average time
          intervals
      
      **Output:**
      1. key access heat map (either in the accessed key space or whole key space)
      2. trace sequence file (interpret the raw trace file to line base text file for future use)
      3. Time serial (The key space ID and its access time)
      4. Key access count distritbution
      5. Key size distribution
      6. Value size distribution (in each intervals)
      7. whole key space separation by the prefix
      8. Accessed key space separation by the prefix
      9. QPS of each operation and each column family
      10. Top K QPS and their accessed prefix range
      
      **Test:**
      1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
      2. Generated the trace and analyze the trace
      
      **Implemented but not tested (due to the limitation of trace_replay):**
      1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
      2. Analyzing the number of Key found by Get
      
      **Future Work:**
      1.  Support execution time analyzing of each requests
      2.  Support cache hit situation and block read situation of Get
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
      
      Differential Revision: D9256157
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
      999d955e
  22. 11 8月, 2018 1 次提交
    • M
      Fix wrong partitioned index size recorded in properties block (#4259) · d511f35e
      Maysam Yabandeh 提交于
      Summary:
      After refactoring in https://github.com/facebook/rocksdb/pull/4158 the properties block is written after the index block. This breaks the existing logic in estimating the index size in partitioned indexes. The patch fixes that by using the accurate index block size, which is available since by the time we write the properties block, the index block is already written.
      The patch also fixes an issue in estimating the partition size with format_version=3 which was resulting into partitions smaller than the configured metadata_block_size.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4259
      
      Differential Revision: D9274454
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: c82d045505cca3e7ed1a44ee1eaa26e4f25a4272
      d511f35e
  23. 10 8月, 2018 1 次提交
    • M
      Index value delta encoding (#3983) · caf0f53a
      Maysam Yabandeh 提交于
      Summary:
      Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.
      
      The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
      Results with sysbench read-only using 4k blocks and using 16 index restart interval:
      Format 2:
      19585   rocksdb read-only range=100
      Format 3:
      19569   rocksdb read-only range=100
      Format 4:
      19352   rocksdb read-only range=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
      
      Differential Revision: D8361343
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
      caf0f53a
  24. 25 7月, 2018 1 次提交
  25. 18 7月, 2018 3 次提交
    • Y
      Release 5.15. (#4148) · 79f009f2
      Yanqin Jin 提交于
      Summary:
      Cut 5.15.fb
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4148
      
      Differential Revision: D8886802
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6b6427ce97f5b323a7eebf92458fda8b24b0cece
      79f009f2
    • Y
      Fix write get stuck when pipelined write is enabled (#4143) · d538ebdf
      Yi Wu 提交于
      Summary:
      Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last).
      
      T1: all writers pending in WAL writer queue:
      WAL writer queue: W1, W2, W3, W4
      memtable writer queue: empty
      
      T2. W1 finish WAL writer and move to memtable writer queue:
      WAL writer queue: W2, W3, W4,
      memtable writer queue: W1
      
      T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader.
      WAL writer queue: W4
      memtable writer queue: W1, W2, W3
      
      T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2.
      WAL writer queue: W4
      memtable writer queue: empty
      
      T5. The thread corresponding to W3 create another writer W3' with the same address as W3.
      WAL writer queue: W4, W3'
      memtable writer queue: empty
      
      T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up.
      
      The issue exists since pipelined write was introduced in 5.5.0.
      
      Closes #3704
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143
      
      Differential Revision: D8871599
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a
      d538ebdf
    • S
      Remove managed iterator · ddc07b40
      Siying Dong 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124
      
      Differential Revision: D8829910
      
      Pulled By: siying
      
      fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560
      ddc07b40
  26. 17 7月, 2018 1 次提交
  27. 04 7月, 2018 1 次提交
  28. 28 6月, 2018 2 次提交
  29. 27 6月, 2018 1 次提交
    • N
      Add table property tracking number of range deletions (#4016) · 17339dc2
      Nikhil Benesch 提交于
      Summary:
      Add a new table property, rocksdb.num.range-deletions, which tracks the
      number of range deletions in a block-based table. Range deletions are no
      longer counted in rocksdb.num.entries; as discovered in PR #3778, there
      are various code paths that implicitly assume that rocksdb.num.entries
      counts only true keys, not range deletions.
      
      /cc ajkr nvanbenschoten
      Closes https://github.com/facebook/rocksdb/pull/4016
      
      Differential Revision: D8527575
      
      Pulled By: ajkr
      
      fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
      17339dc2
  30. 23 6月, 2018 1 次提交
    • M
      Pin top-level index on partitioned index/filter blocks (#4037) · 80ade9ad
      Maysam Yabandeh 提交于
      Summary:
      Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
      Closes https://github.com/facebook/rocksdb/pull/4037
      
      Differential Revision: D8596218
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
      80ade9ad
  31. 22 6月, 2018 1 次提交
    • S
      Improve direct IO range scan performance with readahead (#3884) · 7103559f
      Sagar Vemuri 提交于
      Summary:
      This PR extends the improvements in #3282 to also work when using Direct IO.
      We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
      
      **Description:**
      This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
      
      **Implementation Details:**
      - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
      - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
      - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
      - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
      - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
      
      **Constraints:**
      - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
      - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
      - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
      
      **Benchmarks:**
      I used the same benchmark as used in #3282.
      Data fill:
      ```
      TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
      ```
      
      Do a long range scan: Seekrandom with large number of nexts
      ```
      TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
      ```
      
      ```
      Before:
      seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
      With this change:
      seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
      ```
      ~4.5X perf improvement. Taken on an average of 3 runs.
      Closes https://github.com/facebook/rocksdb/pull/3884
      
      Differential Revision: D8082143
      
      Pulled By: sagar0
      
      fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
      7103559f
  32. 20 6月, 2018 1 次提交
  33. 02 6月, 2018 1 次提交
    • A
      Copy Get() result when file reads use mmap · fea2b1df
      Andrew Kryczka 提交于
      Summary:
      For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
      
      For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
      
      This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
      
      This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
      Closes https://github.com/facebook/rocksdb/pull/3881
      
      Differential Revision: D8076288
      
      Pulled By: ajkr
      
      fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
      fea2b1df