1. 24 1月, 2019 1 次提交
  2. 09 1月, 2019 1 次提交
  3. 05 1月, 2019 1 次提交
    • A
      Fix point lookup on range tombstone sentinel endpoint (#4829) · 9e2c804f
      Andrew Kryczka 提交于
      Summary:
      Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
      
      - L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
      - L1's file2 gets compacted to L2.
      - User issues `Get()` for b#3.
      - L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
      - `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
      
      The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
      
      Differential Revision: D13561355
      
      Pulled By: ajkr
      
      fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
      9e2c804f
  4. 29 12月, 2018 1 次提交
    • S
      Preload some files even if options.max_open_files (#3340) · f0dda35d
      Siying Dong 提交于
      Summary:
      Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340
      
      Differential Revision: D6686945
      
      Pulled By: siying
      
      fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
      f0dda35d
  5. 18 12月, 2018 2 次提交
  6. 14 12月, 2018 1 次提交
    • Y
      Improve flushing multiple column families (#4708) · 4fce44fc
      Yanqin Jin 提交于
      Summary:
      If one column family is dropped, we should simply skip it and continue to flush
      other active ones.
      Currently we use Status::ShutdownInProgress to notify caller of column families
      being dropped. In the future, we should consider using a different Status code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708
      
      Differential Revision: D13378954
      
      Pulled By: riversand963
      
      fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
      4fce44fc
  7. 22 11月, 2018 2 次提交
    • Y
      Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697) · 05d9d821
      Yi Wu 提交于
      Summary:
      …ons (#4676)"
      
      This reverts commit b32d087d.
      
      `MemoryAllocator` needs to be with `Cache`, since cache entry can
      outlive DB and block based table. The cache needs to hold reference to
      memory allocator when deleting cache entry.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697
      
      Differential Revision: D13133490
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
      05d9d821
    • A
      Introduce RangeDelAggregatorV2 (#4649) · 457f77b9
      Abhishek Madan 提交于
      Summary:
      The old RangeDelAggregator did expensive pre-processing work
      to create a collapsed, binary-searchable representation of range
      tombstones. With FragmentedRangeTombstoneIterator, much of this work is
      now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
      in each iterator to find a covering tombstone in ShouldDelete, while
      doing minimal work in AddTombstones. The old RangeDelAggregator is still
      used during flush/compaction for now, though RangeDelAggregatorV2 will
      support those uses in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
      
      Differential Revision: D13146964
      
      Pulled By: abhimadan
      
      fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
      457f77b9
  8. 21 11月, 2018 1 次提交
    • A
      Fix range tombstone covering short-circuit logic (#4698) · ed5aec5b
      Abhishek Madan 提交于
      Summary:
      Since a range tombstone seen at one level will cover all keys
      in the range at lower levels, there was a short-circuiting check in Get
      that reported a key was not found at most one file after the range
      tombstone was discovered. However, this was incorrect for merge
      operands, since a deletion might only cover some merge operands,
      which implies that the key should be found. This PR fixes this logic in
      the Version portion of Get, and removes the logic from the MemTable
      portion of Get, since the perforamnce benefit provided there is minimal.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698
      
      Differential Revision: D13142484
      
      Pulled By: abhimadan
      
      fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
      ed5aec5b
  9. 17 11月, 2018 1 次提交
  10. 14 11月, 2018 3 次提交
  11. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  12. 31 10月, 2018 2 次提交
    • Y
      Add test to check if DB can handle atomic group (#4433) · d1118f6f
      Yanqin Jin 提交于
      Summary:
      Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size.
      With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433
      
      Differential Revision: D10079202
      
      Pulled By: riversand963
      
      fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876
      d1118f6f
    • A
      Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594) · eaaf1a6f
      Abhishek Madan 提交于
      Summary:
      Since the number of range deletions are reported in
      TableProperties, it is confusing to not report the number of merge
      operands and point deletions as top-level properties; they are
      accessible through the public API, but since they are not the "main"
      properties, they do not appear in aggregated table properties, or the
      string representation of table properties.
      
      This change promotes those two property keys to
      `rocksdb/table_properties.h`, adds corresponding uint64 members for
      them, deprecates the old access methods `GetDeletedKeys()` and
      `GetMergeOperands()` (though they are still usable for now), and removes
      `InternalKeyPropertiesCollector`. The property key strings are the same
      as before this change, so this should be able to read DBs written from older
      versions (though I haven't tested this yet).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594
      
      Differential Revision: D12826893
      
      Pulled By: abhimadan
      
      fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
      eaaf1a6f
  13. 26 10月, 2018 1 次提交
    • A
      Cache fragmented range tombstones in BlockBasedTableReader (#4493) · 7528130e
      Abhishek Madan 提交于
      Summary:
      This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
      
      On the same DB used in #4449, running `readrandom` results in the following:
      ```
      readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
      ```
      
      Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
      ```
         Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
      ----------------- | ------------- | ---------------- | ------------ | ------------
      None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
      500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
      500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
      1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
      1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
      5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
      5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
      10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
      10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
      25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
      25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
      50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
      50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
      ```
      
      After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
      
      Differential Revision: D10842844
      
      Pulled By: abhimadan
      
      fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
      7528130e
  14. 25 10月, 2018 1 次提交
    • A
      Use only "local" range tombstones during Get (#4449) · 8c78348c
      Abhishek Madan 提交于
      Summary:
      Previously, range tombstones were accumulated from every level, which
      was necessary if a range tombstone in a higher level covered a key in a lower
      level. However, RangeDelAggregator::AddTombstones's complexity is based on
      the number of tombstones that are currently stored in it, which is wasteful in
      the Get case, where we only need to know the highest sequence number of range
      tombstones that cover the key from higher levels, and compute the highest covering
      sequence number at the current level. This change introduces this optimization, and
      removes the use of RangeDelAggregator from the Get path.
      
      In the benchmark results, the following command was used to initialize the database:
      ```
      ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
      ```
      
      ...and the following command was used to measure read throughput:
      ```
      ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
      ```
      
      The filluniquerandom command was only run once, and the resulting database was used
      to measure read performance before and after the PR. Both binaries were compiled with
      `DEBUG_LEVEL=0`.
      
      Readrandom results before PR:
      ```
      readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
      ```
      
      Readrandom results after PR:
      ```
      readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
      ```
      
      So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
      
      ----
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
      
      Differential Revision: D10370575
      
      Pulled By: abhimadan
      
      fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
      8c78348c
  15. 23 10月, 2018 2 次提交
    • M
      Fix user comparator receiving internal key (#4575) · c34cc404
      Maysam Yabandeh 提交于
      Summary:
      There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575
      
      Differential Revision: D10500434
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
      c34cc404
    • S
      Dynamic level to adjust level multiplier when write is too heavy (#4338) · 70242636
      Siying Dong 提交于
      Summary:
      Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.
      
      We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338
      
      Differential Revision: D9636782
      
      Pulled By: siying
      
      fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
      70242636
  16. 20 10月, 2018 1 次提交
    • Y
      Add read retry support to log reader (#4394) · da4aa59b
      Yanqin Jin 提交于
      Summary:
      Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.
      
      Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394
      
      Differential Revision: D9926508
      
      Pulled By: riversand963
      
      fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
      da4aa59b
  17. 16 10月, 2018 2 次提交
    • A
      Properly determine a truncated CompactRange stop key (#4496) · 1e384580
      anand1976 提交于
      Summary:
      When a CompactRange() call for a level is truncated before the end key
      is reached, because it exceeds max_compaction_bytes, we need to properly
      set the compaction_end parameter to indicate the stop key. The next
      CompactRange will use that as the begin key. We set it to the smallest
      key of the next file in the level after expanding inputs to get a clean
      cut.
      
      Previously, we were setting it before expanding inputs. So we could end
      up recompacting some files. In a pathological case, where a single key
      has many entries spanning all the files in the level (possibly due to
      merge operands without a partial merge operator, thus resulting in
      compaction output identical to the input), this would result in
      an endless loop over the same set of files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496
      
      Differential Revision: D10395026
      
      Pulled By: anand1976
      
      fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
      1e384580
    • Y
      Add support to flush multiple CFs atomically (#4262) · e633983c
      Yanqin Jin 提交于
      Summary:
      Leverage existing `FlushJob` to implement atomic flush of multiple column families.
      
      This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262
      
      Differential Revision: D9283109
      
      Pulled By: riversand963
      
      fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed
      e633983c
  18. 13 10月, 2018 1 次提交
    • Y
      Add listener to sample file io (#3933) · 729a617b
      Yanqin Jin 提交于
      Summary:
      We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
      To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933
      
      Differential Revision: D10219571
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
      729a617b
  19. 10 10月, 2018 1 次提交
  20. 05 10月, 2018 1 次提交
  21. 04 10月, 2018 1 次提交
  22. 28 9月, 2018 1 次提交
  23. 14 9月, 2018 1 次提交
  24. 06 9月, 2018 1 次提交
  25. 24 8月, 2018 1 次提交
  26. 21 8月, 2018 1 次提交
  27. 28 7月, 2018 1 次提交
    • Y
      Remove random writes from SST file ingestion (#4172) · 54de5684
      Yanqin Jin 提交于
      Summary:
      RocksDB used to store global_seqno in external SST files written by
      SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
      `global_seqno`. Since random write is not supported in some non-POSIX compliant
      file systems, external SST file ingestion is not supported on these file
      systems. To address this limitation, we no longer update `global_seqno` during
      file ingestion. Later RocksDB uses the MANIFEST and other information in table
      properties to deduce global seqno for externally-ingested SST files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172
      
      Differential Revision: D8961465
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
      54de5684
  28. 21 7月, 2018 1 次提交
    • Z
      Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490) · f95a5b24
      Zhongyi Xie 提交于
      Summary:
      Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490
      
      Differential Revision: D6969154
      
      Pulled By: miasantreble
      
      fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
      f95a5b24
  29. 20 7月, 2018 1 次提交
    • Y
      Fix a bug in MANIFEST group commit (#4157) · 2736752b
      Yanqin Jin 提交于
      Summary:
      PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The
      implementation has a bug. When updating the log file number of each column
      family, we must consider only `VersionEdit`s that operate on the same column
      family. Otherwise, a column family may accidentally set its log file number
      higher than actual value, indicating that log files with smaller file number
      will be ignored, thus causing some updates to be lost.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157
      
      Differential Revision: D8916650
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201
      2736752b
  30. 17 7月, 2018 1 次提交
  31. 14 7月, 2018 1 次提交
    • P
      Relax VersionStorageInfo::GetOverlappingInputs check (#4050) · 90fc4069
      Peter Mattis 提交于
      Summary:
      Do not consider the range tombstone sentinel key as causing 2 adjacent
      sstables in a level to overlap. When a range tombstone's end key is the
      largest key in an sstable, the sstable's end key is so to a "sentinel"
      value that is the smallest key in the next sstable with a sequence
      number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
      overlap in internal-key space with the next sstable. Unfortunately,
      GetOverlappingFiles uses user-keys to determine overlap and was thus
      considering 2 adjacent sstables in a level to overlap if they were
      separated by this sentinel key. This in turn would cause compactions to
      be larger than necessary.
      
      Note that this conflicts with
      https://github.com/facebook/rocksdb/pull/2769 and cases
      `DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
      fail.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050
      
      Differential Revision: D8844423
      
      Pulled By: ajkr
      
      fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
      90fc4069
  32. 29 6月, 2018 2 次提交
    • Z
      fix clang analyzer warnings (#4072) · b3efb1cb
      Zhongyi Xie 提交于
      Summary:
      clang analyze is giving the following warnings:
      > db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
          } else if (meta->smallest.size() > 0) {
                     ^~~~~~~~~~~~~~~~~~~~~
      db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
          meta->marked_for_compaction = sub_compact->builder->NeedCompact();
          ~~~~
      db/version_set.cc:2770:26: warning: Called C++ object pointer is null
              uint32_t cf_id = last_writer->cfd->GetID();
                               ^~~~~~~~~~~~~~~~~~~~~~~~~
      Closes https://github.com/facebook/rocksdb/pull/4072
      
      Differential Revision: D8685852
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
      b3efb1cb
    • Y
      Support group commits of version edits (#3944) · 26d67e35
      Yanqin Jin 提交于
      Summary:
      This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
      Closes https://github.com/facebook/rocksdb/pull/3944
      
      Differential Revision: D8432536
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
      26d67e35