1. 20 11月, 2019 2 次提交
  2. 19 11月, 2019 3 次提交
    • L
      Mark blob files not needed by any memtables/SSTs obsolete (#6032) · 279c4883
      Levi Tamasi 提交于
      Summary:
      The patch adds logic to mark no longer needed blob files obsolete upon database open
      and whenever a flush or compaction completes. Unneeded blob files are detected by
      iterating through live immutable non-TTL blob files starting from the lowest-numbered one,
      and stopping when a blob file used by any SSTs or potentially used by memtables is found.
      (The latter is determined by comparing the sequence number at which the blob file
      became immutable with the largest sequence number received in flush notifications.)
      
      In addition, the patch cleans up the logic around closing and obsoleting blob files and
      enforces invariants around this area (blob files are now guaranteed to go through the
      stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this
      order).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032
      
      Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`.
      
      Differential Revision: D18495610
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961
      279c4883
    • S
      db_stress to cover total order seek (#6039) · a150604e
      sdong 提交于
      Summary:
      Right now, in db_stress, as long as prefix extractor is defined, TestIterator always uses. There is value of cover total_order_seek = true when prefix extractor is define. Add a small chance that this flag is turned on.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6039
      
      Test Plan: Run the test for a while.
      
      Differential Revision: D18539689
      
      fbshipit-source-id: 568790dd7789c9986b83764b870df0423a122d99
      a150604e
    • A
      Fix a test failure on systems that don't have Snappy compression libraries (#6038) · 5b9233bf
      anand76 提交于
      Summary:
      The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038
      
      Differential Revision: D18532370
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075
      5b9233bf
  3. 16 11月, 2019 1 次提交
  4. 15 11月, 2019 3 次提交
  5. 14 11月, 2019 4 次提交
    • P
      More fixes to auto-GarbageCollect in BackupEngine (#6023) · e8e7fb1d
      Peter Dillinger 提交于
      Summary:
      Production:
      * Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
      * Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
      * Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023
      
      Test Plan:
      * Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
      * Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
      * Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.
      
      Differential Revision: D18453559
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
      e8e7fb1d
    • P
      New Bloom filter implementation for full and partitioned filters (#6007) · f059c7d9
      Peter Dillinger 提交于
      Summary:
      Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
      
      Speed
      
      The improved speed, at least on recent x86_64, comes from
      * Using fastrange instead of modulo (%)
      * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
      * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
      * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
      
      Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
      
      $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
      Build avg ns/key: 47.7135
      Mixed inside/outside queries...
        Single filter net ns/op: 26.2825
        Random filter net ns/op: 150.459
          Average FP rate %: 0.954651
      $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
      Build avg ns/key: 47.2245
      Mixed inside/outside queries...
        Single filter net ns/op: 63.2978
        Random filter net ns/op: 188.038
          Average FP rate %: 1.13823
      
      Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
      
      The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
      
      Accuracy
      
      The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
      within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
      
      Accuracy data (generalizes, except old impl gets worse with millions of keys):
      Memory bits per key: FP rate percent old impl -> FP rate percent new impl
      6: 5.70953 -> 5.69888
      8: 2.45766 -> 2.29709
      10: 1.13977 -> 0.959254
      12: 0.662498 -> 0.411593
      16: 0.353023 -> 0.0873754
      24: 0.261552 -> 0.0060971
      50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
      
      Fixes https://github.com/facebook/rocksdb/issues/5857
      Fixes https://github.com/facebook/rocksdb/issues/4120
      
      Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
      
      Compatibility
      
      Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
      
      Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
      
      Differential Revision: D18294749
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
      f059c7d9
    • F
      fix typo (#6025) · f382f44e
      Fatih Şentürk 提交于
      Summary:
      fix a typo at java readme page
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6025
      
      Differential Revision: D18481232
      
      fbshipit-source-id: 1c70c2435bcd4b02f25e28cd7e35c42273e07be0
      f382f44e
    • S
      Fix a regression bug on total order seek with prefix enabled and range delete (#6028) · bb23bfe6
      sdong 提交于
      Summary:
      Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
      Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
      Fix the bug by setting prefix_extractor to be null if total order seek is used.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028
      
      Test Plan: Add a unit test which fails without the fix.
      
      Differential Revision: D18479063
      
      fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3
      bb23bfe6
  6. 13 11月, 2019 2 次提交
    • P
      Fix BloomFilterPolicy changes for unsigned char (ARM) (#6024) · 42b5494e
      Peter Dillinger 提交于
      Summary:
      Bug in PR https://github.com/facebook/rocksdb/issues/5941 when char is unsigned that should only affect
      assertion on unused/invalid filter metadata.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6024
      
      Test Plan: on ARM: ./bloom_test && ./db_bloom_filter_test && ./block_based_filter_block_test && ./full_filter_block_test && ./partitioned_filter_block_test
      
      Differential Revision: D18461206
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 68a7c813a0b5791c05265edc03cdf52c78880e9a
      42b5494e
    • A
      Batched MultiGet API for multiple column families (#5816) · 6c7b1a0c
      anand76 提交于
      Summary:
      Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
      
      As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
      
      Test Plan:
      make check
      make asan_check
      asan_crash_test
      
      Differential Revision: D18408676
      
      Pulled By: anand1976
      
      fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
      6c7b1a0c
  7. 12 11月, 2019 5 次提交
    • S
      db_stress to cover SeekForPrev() (#6022) · a19de78d
      sdong 提交于
      Summary:
      Right now, db_stress doesn't cover SeekForPrev(). Add the coverage, which mirrors what we do for Seek().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6022
      
      Test Plan: Run "make crash_test". Do some manual source code hack to simular iterator wrong results and see it caught.
      
      Differential Revision: D18442193
      
      fbshipit-source-id: 879b79000d5e33c625c7e970636de191ccd7776c
      a19de78d
    • A
      Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014) · 03ce7fb2
      anand76 提交于
      Summary:
      The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014
      
      Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.
      
      Differential Revision: D18412753
      
      Pulled By: anand1976
      
      fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
      03ce7fb2
    • bugfix: MemTableList::RemoveOldMemTables invalid iterator after remov… (#6013) · f29e6b3b
      蔡渠棠 提交于
      Summary:
      Fix issue https://github.com/facebook/rocksdb/issues/6012.
      
      I found that it may be caused by the following codes in function _RemoveOldMemTables()_ in **db/memtable_list.cc**  :
      ```
        for (auto it = memlist.rbegin(); it != memlist.rend(); ++it) {
          MemTable* mem = *it;
          if (mem->GetNextLogNumber() > log_number) {
            break;
          }
          current_->Remove(mem, to_delete);
      ```
      
      The iterator **it** turns invalid after `current_->Remove(mem, to_delete);`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6013
      
      Test Plan:
      ```
      make check
      ```
      
      Differential Revision: D18401107
      
      Pulled By: riversand963
      
      fbshipit-source-id: bf0da3b868ed70f7aff24cf7b3e2049c0c5c7a4e
      f29e6b3b
    • S
      Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992) · c17384fe
      Sagar Vemuri 提交于
      Summary:
      When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.
      
      This will help start cascading compactions to move the expired key range to the bottom-most level faster.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992
      
      Test Plan: `make check`
      
      Differential Revision: D18257883
      
      Pulled By: sagar0
      
      fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc
      c17384fe
    • L
      BlobDB: Maintain mapping between blob files and SSTs (#6020) · 8e7aa628
      Levi Tamasi 提交于
      Summary:
      The patch adds logic to BlobDB to maintain the mapping between blob files
      and SSTs for which the blob file in question is the oldest blob file referenced
      by the SST file. The mapping is initialized during database open based on the
      information retrieved using `GetLiveFilesMetaData`, and updated after
      flushes/compactions based on the information received through the `EventListener`
      interface (or, in the case of manual compactions issued through the `CompactFiles`
      API, the `CompactionJobInfo` object).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020
      
      Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`.
      
      Differential Revision: D18410508
      
      Pulled By: ltamasi
      
      fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5
      8e7aa628
  8. 09 11月, 2019 2 次提交
    • P
      Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015) · aa63abf6
      Peter Dillinger 提交于
      Summary:
      Only if there is a crash, power failure, or I/O error in
      DeleteBackup, shared or private files from the backup might be left
      behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
      by GarbageCollect. This makes the BackupEngine API "leaky by default."
      Even if it means a modest performance hit, I think we should make
      Delete and Purge do as they say, with ongoing best effort: i.e. future
      calls will attempt to finish any incomplete work from earlier calls.
      
      This change does that by having DeleteBackup and PurgeOldBackups do a
      GarbageCollect, unless (to minimize performance hit) this BackupEngine
      has already done a GarbageCollect and there have been no
      deletion-related I/O errors in that GarbageCollect or since then.
      
      Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.
      
      Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015
      
      Test Plan: Updated unit tests
      
      Differential Revision: D18401333
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
      aa63abf6
    • Y
      Fix DBFlushTest::FireOnFlushCompletedAfterCommittedResult hang (#6018) · 72de842a
      Yi Wu 提交于
      Summary:
      The test would fire two flushes to let them run in parallel. Previously it wait for the first job to be scheduled before firing the second. It is possible the job is not started before the second job being scheduled, making the two job combine into one. Change to wait for the first job being started.
      
      Fixes https://github.com/facebook/rocksdb/issues/6017
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6018
      
      Test Plan:
      ```
      while ./db_flush_test --gtest_filter=*FireOnFlushCompletedAfterCommittedResult*; do :; done
      ```
      and let it run for a while.
      Signed-off-by: NYi Wu <yiwu@pingcap.com>
      
      Differential Revision: D18405576
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6ebb6262e033d5dc2ef81cb3eb410b314f2de4c9
      72de842a
  9. 08 11月, 2019 7 次提交
  10. 07 11月, 2019 3 次提交
    • S
      db_stress: improve TestGet() failure printing (#5989) · 111ebf31
      sdong 提交于
      Summary:
      Right now, in db_stress's CF consistency test's TestGet case, if failure happens, we do normal string printing, rather than hex printing, so that some text is not printed out, which makes debugging harder. Fix it by printing hex instead.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5989
      
      Test Plan: Build db_stress and see t passes.
      
      Differential Revision: D18363552
      
      fbshipit-source-id: 09d1b8f6fbff37441cbe7e63a1aef27551226cec
      111ebf31
    • Z
      Workload generator (Mixgraph) based on prefix hotness (#5953) · 8ea087ad
      Zhichao Cao 提交于
      Summary:
      In the previous PR https://github.com/facebook/rocksdb/issues/4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = a*exp(b*x) + c*exp(d*x)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters.
      For example:
      `./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5953
      
      Test Plan: run db_bench with different parameters and checked the results.
      
      Differential Revision: D18053527
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7
      8ea087ad
    • M
      Enable write-conflict snapshot in stress tests (#5897) · 50804656
      Maysam Yabandeh 提交于
      Summary:
      DBImpl extends the public GetSnapshot() with GetSnapshotForWriteConflictBoundary() method that takes snapshots specially for write-write conflict checking. Compaction treats such snapshots differently to avoid GCing a value written after that, so that the write conflict remains visible even after the compaction. The patch extends stress tests with such snapshots.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5897
      
      Differential Revision: D17937476
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bd8b0c578827990302194f63ae0181e15752951d
      50804656
  11. 06 11月, 2019 3 次提交
  12. 05 11月, 2019 1 次提交
    • M
      WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850) · 52733b44
      Maysam Yabandeh 提交于
      Summary:
      MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky.
      The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850
      
      Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="*MaxCatchupWithNewSnapshot*"
      
      Differential Revision: D17608926
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136
      52733b44
  13. 02 11月, 2019 3 次提交
  14. 01 11月, 2019 1 次提交
    • S
      crash_test: disable periodic compaction in FIFO compaction. (#5993) · 5b656584
      sdong 提交于
      Summary:
      A recent commit make periodic compaction option valid in FIFO, which means TTL. But we fail to disable it in crash test, causing assert failure. Fix it by having it disabled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5993
      
      Test Plan: Restart "make crash_test" many times and make sure --periodic_compaction_seconds=0 is always the case when --compaction_style=2
      
      Differential Revision: D18263223
      
      fbshipit-source-id: c91a802017d83ae89ac43827d1b0012861933814
      5b656584