1. 27 11月, 2019 15 次提交
    • A
      Fix HISTORY.md for 6.6.0 (#6096) · 496a6ae8
      anand76 提交于
      Summary:
      Some of the entries were incorrectly listed under 6.5.0.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6096
      
      Differential Revision: D18722801
      
      Pulled By: gfosco
      
      fbshipit-source-id: 18d1187deb6a9d69a8feb68b727d2f720a65f2bc
      496a6ae8
    • P
      Expose and elaborate FilterBuildingContext (#6088) · ca3b6c28
      Peter Dillinger 提交于
      Summary:
      This change enables custom implementations of FilterPolicy to
      wrap a variety of NewBloomFilterPolicy and select among them based on
      contextual information such as table level and compaction style.
      
      * Moves FilterBuildingContext to public API and elaborates it with more
      useful data. (It would be nice to put more general options-like data,
      but at the time this object is constructed, we are using internal APIs
      ImmutableCFOptions and MutableCFOptions and don't have easy access to
      ColumnFamilyOptions that I can tell.)
      
      * Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
      GetBuilderWithContext, because it's now public.
      
      * Plumbs through the table's "level_at_creation" for filter building
      context.
      
      * Simplified some tests by adding GetBuilder() to
      MockBlockBasedTableTester.
      
      * Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
      sample wrapper class LevelAndStyleCustomFilterPolicy.
      
      * Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
      where it does not reset perf context.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088
      
      Test Plan: make check, valgrind on db_bloom_filter_test
      
      Differential Revision: D18697817
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
      ca3b6c28
    • A
      Fix compilation under MSVC VS2015 (#6081) · 6d58ea90
      Adam Retter 提交于
      Summary:
      **NOTE**: this also needs to be back-ported to 6.4.6 and possibly older branches if further releases from them is envisaged.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6081
      
      Differential Revision: D18710107
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 03260f9316566e2bfc12c7d702d6338bb7941e01
      6d58ea90
    • P
      Add shared library for musl-libc (#3143) · 8ae149eb
      Patrick Double 提交于
      Summary:
      Add the jni library for musl-libc, specifically for incorporating into Alpine based docker images. The classifier is `musl64`.
      
      I have signed the CLA electronically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3143
      
      Differential Revision: D18719372
      
      fbshipit-source-id: 6189d149310b6436d6def7d808566b0234b23313
      8ae149eb
    • L
      Refactor and clean up the code that reads a blob from a file (#6093) · d9314a92
      Levi Tamasi 提交于
      Summary:
      This patch factors out the logic that reads a (potentially compressed) blob
      from a file into a separate helper method `GetRawBlobFromFile`, and cleans
      up the code a bit. Also, errors during decompression are now logged/propagated
      to the user by returning a `Status` code of `Corruption`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6093
      
      Test Plan: `make check`
      
      Differential Revision: D18716673
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 44144bc064cab616862d5643f34384f2bae6eb78
      d9314a92
    • P
      Allow fractional bits/key in BloomFilterPolicy (#6092) · 57f30322
      Peter Dillinger 提交于
      Summary:
      There's no technological impediment to allowing the Bloom
      filter bits/key to be non-integer (fractional/decimal) values, and it
      provides finer control over the memory vs. accuracy trade-off. This is
      especially handy in using the format_version=5 Bloom filter in place
      of the old one, because bits_per_key=9.55 provides the same accuracy as
      the old bits_per_key=10.
      
      This change not only requires refining the logic for choosing the best
      num_probes for a given bits/key setting, it revealed a flaw in that logic.
      As bits/key gets higher, the best num_probes for a cache-local Bloom
      filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
      standard Bloom filter. For example, at 16 bits per key, the best
      num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
      This change fixes and refines that logic (for the format_version=5
      Bloom filter only, just in case) based on empirical tests to find
      accuracy inflection points between each num_probes.
      
      Although bits_per_key is now specified as a double, the new Bloom
      filter converts/rounds this to "millibits / key" for predictable/precise
      internal computations. Just in case of unforeseen compatibility
      issues, we round to the nearest whole number bits / key for the
      legacy Bloom filter, so as not to unlock new behaviors for it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092
      
      Test Plan: unit tests included
      
      Differential Revision: D18711313
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
      57f30322
    • L
      Refactor blob file creation logic (#6066) · 72daa92d
      Levi Tamasi 提交于
      Summary:
      The patch refactors and cleans up the logic around creating new blob files
      by moving the common code of `SelectBlobFile` and `SelectBlobFileTTL`
      to a new helper method `CreateBlobFileAndWriter`, bringing the implementation
      of `SelectBlobFile` and `SelectBlobFileTTL` into sync, and increasing encapsulation
      by adding new constructors for `BlobFile` and `BlobLogHeader`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6066
      
      Test Plan:
      Ran `make check` and used the BlobDB mode of `db_bench` to sanity test both
      the TTL and the non-TTL code paths.
      
      Differential Revision: D18646921
      
      Pulled By: ltamasi
      
      fbshipit-source-id: e5705a84807932e31dccab4f49b3e64369cea26d
      72daa92d
    • J
      Use lowercase for shlwapi.lib rpcrt4.lib (#6076) · 771e1723
      John Ericson 提交于
      Summary:
      This fixes MinGW cross compilation from case-sensative file systems, at no harm to MinGW builds on  Windows.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6076
      
      Differential Revision: D18710554
      
      fbshipit-source-id: a9f299ac3aa019f7dbc07ed0c4a79e19cf99b488
      771e1723
    • A
      Fix naming of library on PPC64LE (#6080) · 1bf316e5
      Adam Retter 提交于
      Summary:
      **NOTE**: This also needs to be back-ported to be 6.4.6
      
      Fix a regression introduced in f2bf0b2d by https://github.com/facebook/rocksdb/pull/5674 whereby the compiled library would get the wrong name on PPC64LE platforms.
      
      On PPC64LE, the regression caused the library to be named `librocksdbjni-linux64.so` instead of `librocksdbjni-linux-ppc64le.so`.
      
      This PR corrects the name back to `librocksdbjni-linux-ppc64le.so` and also corrects the ordering of conditional arguments in the Makefile to match the expected order as defined in the documentation for Make.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6080
      
      Differential Revision: D18710351
      
      fbshipit-source-id: d4db87ef378263b57de7f9edce1b7d15644cf9de
      1bf316e5
    • A
      Small improvements to Docker build for RocksJava (#6079) · 7f145195
      Adam Retter 提交于
      Summary:
      * We can reuse downloaded 3rd-party libraries
      * We can isolate the build to a Docker volume. This is useful for investigating failed builds, as we can examine the volume by assigning it a name during the build.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6079
      
      Differential Revision: D18710263
      
      fbshipit-source-id: 93f456ba44b49e48941c43b0c4d53995ecc1f404
      7f145195
    • P
      Remove unused/undefined ImmutableCFOptions() (#6086) · 4f17d33d
      Peter Dillinger 提交于
      Summary:
      default constructor not used or even defined
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6086
      
      Differential Revision: D18695669
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6b6ac46029f4fb6edf1c11ee6ce1d9f172b2eaf2
      4f17d33d
    • A
      Update 3rd-party libraries used by RocksJava (#6084) · 382b154b
      Adam Retter 提交于
      Summary:
      * LZ4 1.8.3 -> 1.9.2
      * ZSTD 1.4.0 -> 1.4.4
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6084
      
      Differential Revision: D18710224
      
      fbshipit-source-id: a461ef19a473d3480acdc027f627ec3048730692
      382b154b
    • S
      Make default value of options.ttl to be 30 days when it is supported. (#6073) · 77eab5c8
      sdong 提交于
      Summary:
      By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.
      
      Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.
      
      Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073
      
      Test Plan: Add unit tests for it.
      
      Differential Revision: D18669626
      
      fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
      77eab5c8
    • S
      Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072) · fcd7e038
      Sebastiano Peluso 提交于
      Summary:
      This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only.
      
      Issue: https://github.com/facebook/rocksdb/issues/4997
      
      Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072
      
      Reviewed By: pdillinger
      
      Differential Revision: D18687364
      
      Pulled By: sebastianopeluso
      
      fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e
      fcd7e038
    • S
      Update HISTORY.md for forward compatibility (#6085) · 0bc87442
      sdong 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/6060 broke forward compatiblity for releases from 3.10 to 4.2. Update HISTORY.md to mention it. Also remove it from the compatibility tests.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6085
      
      Differential Revision: D18691694
      
      fbshipit-source-id: 4ef903783dc722b8a4d3e8229abbf0f021a114c9
      0bc87442
  2. 23 11月, 2019 4 次提交
  3. 22 11月, 2019 1 次提交
  4. 21 11月, 2019 4 次提交
    • Y
      Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056) · 0ce0edbe
      Yanqin Jin 提交于
      Summary:
      Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
      called in DBImpl::GetColumnFamilyMetaData().
      Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
      for access to FileMetaData::being_compacted.
      Other than mutex, there are several more alternatives.
      
      - Make FileMetaData::being_compacted an atomic variable. This will make
        FileMetaData non-copy-able.
      
      - Separate being_compacted from FileMetaData. This requires re-organizing data
        structures that are already used in many places.
      
      Test Plan (dev server):
      ```
      make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056
      
      Differential Revision: D18620488
      
      Pulled By: riversand963
      
      fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
      0ce0edbe
    • C
      Add asserts in transaction example (#6055) · c0983d06
      Cheng Chang 提交于
      Summary:
      The intention of the example for read committed is clearer with these added asserts.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6055
      
      Test Plan: `cd examples && make transaction_example && ./transaction_example`
      
      Differential Revision: D18621830
      
      Pulled By: riversand963
      
      fbshipit-source-id: a94b08c5958b589049409ee4fc4d6799e5cbef79
      c0983d06
    • S
      Add operator[] to autovector::iterator_impl. (#6047) · 3cd75736
      Stephan T. Lavavej 提交于
      Summary:
      This is a required operator for random-access iterators, and an upcoming update for Visual Studio 2019 will change the C++ Standard Library's heap algorithms to use this operator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6047
      
      Differential Revision: D18618531
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 08d10bc85bf2dbc3f7ef0fa3c777e99f1e927ef5
      3cd75736
    • S
      Sanitize input in DB::MultiGet() API (#6054) · 27ec3b34
      sdong 提交于
      Summary:
      The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054
      
      Test Plan: Build with GCC-9 and see it passes.
      
      Differential Revision: D18608958
      
      fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2
      27ec3b34
  5. 20 11月, 2019 7 次提交
    • P
      Fixes for g++ 4.9.2 compatibility (#6053) · 0306e012
      Peter Dillinger 提交于
      Summary:
      Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043
      
      Stackoverflow ref on {{}} vs. {}:
      https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list
      
      Note to reader: .clear() does not empty out an ostringstream, but .str("")
      suffices because we don't have to worry about clearing error flags.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053
      
      Test Plan: make check, manual run of filter_bench
      
      Differential Revision: D18602259
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841
      0306e012
    • L
      Fix corruption with intra-L0 on ingested files (#5958) · ec3e3c3e
      Little-Wallace 提交于
      Summary:
      ## Problem Description
      
      Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.
      
      * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
      * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
      * `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
      * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
          * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
          * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
          > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno
      
      So I skip pick sst file which was ingested in function `FindIntraL0Compaction `
      
      ## Reason
      
      Here is my bug report: https://github.com/facebook/rocksdb/issues/5913
      
      There are two situations that can cause the check to fail.
      
      ### First situation:
      - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.
      
      - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.
      
      - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
      - RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.
      
      ### Secondary situaion
      
      - First we have flushed many sst in L0,  we call them [s1, s2, s3].
      
      - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.
      
      - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.
      
      - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
        - compacted 4@0 files to L0
      - When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
         - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958
      
      Differential Revision: D18601316
      
      fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
      ec3e3c3e
    • L
      Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052) · 019eb1f4
      Levi Tamasi 提交于
      Summary:
      The SetOptions API used by the test is not supported in LITE mode,
      so we should skip the new chunk in this case.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052
      
      Test Plan: Ran the unit tests both in regular and LITE mode.
      
      Differential Revision: D18601763
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04
      019eb1f4
    • S
      db_stress sometimes generates keys close to SST file boundaries (#6037) · 4e0dcd36
      sdong 提交于
      Summary:
      Recently, a bug was found related to a seek key that is close to SST file boundary. However, it only occurs in a very small chance in db_stress, because the chance that a random key hits SST file boundaries is small. To boost the chance, with 1/16 chance, we pick keys that are close to SST file boundaries.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6037
      
      Test Plan: Did some manual printing out, and hack to cover the key generation logic to be correct.
      
      Differential Revision: D18598476
      
      fbshipit-source-id: 13b76687d106c5be4e3e02a0c77fa5578105a071
      4e0dcd36
    • T
      Fix blob context when db_iter uses seek (#6051) · 20b48c64
      tabokie 提交于
      Summary:
      Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
      Also patch existing test for the mentioned code path.
      Signed-off-by: Ntabokie <xy.tao@outlook.com>
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051
      
      Differential Revision: D18596274
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
      20b48c64
    • A
      Fix test failure in LITE mode (#6050) · 38cc6112
      anand76 提交于
      Summary:
      GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050
      
      Test Plan:
      make LITE=1 check
      make check
      
      Differential Revision: D18588114
      
      Pulled By: anand1976
      
      fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d
      38cc6112
    • P
      Remove a few unnecessary includes · ac498cdb
      Peter Dillinger 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6046
      
      Test Plan: make check, manual inspection
      
      Differential Revision: D18573044
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7a5999fc08d798ce3157b56d4b36d24027409fc3
      ac498cdb
  6. 19 11月, 2019 3 次提交
    • L
      Mark blob files not needed by any memtables/SSTs obsolete (#6032) · 279c4883
      Levi Tamasi 提交于
      Summary:
      The patch adds logic to mark no longer needed blob files obsolete upon database open
      and whenever a flush or compaction completes. Unneeded blob files are detected by
      iterating through live immutable non-TTL blob files starting from the lowest-numbered one,
      and stopping when a blob file used by any SSTs or potentially used by memtables is found.
      (The latter is determined by comparing the sequence number at which the blob file
      became immutable with the largest sequence number received in flush notifications.)
      
      In addition, the patch cleans up the logic around closing and obsoleting blob files and
      enforces invariants around this area (blob files are now guaranteed to go through the
      stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this
      order).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032
      
      Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`.
      
      Differential Revision: D18495610
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961
      279c4883
    • S
      db_stress to cover total order seek (#6039) · a150604e
      sdong 提交于
      Summary:
      Right now, in db_stress, as long as prefix extractor is defined, TestIterator always uses. There is value of cover total_order_seek = true when prefix extractor is define. Add a small chance that this flag is turned on.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6039
      
      Test Plan: Run the test for a while.
      
      Differential Revision: D18539689
      
      fbshipit-source-id: 568790dd7789c9986b83764b870df0423a122d99
      a150604e
    • A
      Fix a test failure on systems that don't have Snappy compression libraries (#6038) · 5b9233bf
      anand76 提交于
      Summary:
      The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038
      
      Differential Revision: D18532370
      
      Pulled By: anand1976
      
      fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075
      5b9233bf
  7. 16 11月, 2019 1 次提交
  8. 15 11月, 2019 3 次提交
  9. 14 11月, 2019 2 次提交
    • P
      More fixes to auto-GarbageCollect in BackupEngine (#6023) · e8e7fb1d
      Peter Dillinger 提交于
      Summary:
      Production:
      * Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
      * Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
      * Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023
      
      Test Plan:
      * Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
      * Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
      * Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.
      
      Differential Revision: D18453559
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
      e8e7fb1d
    • P
      New Bloom filter implementation for full and partitioned filters (#6007) · f059c7d9
      Peter Dillinger 提交于
      Summary:
      Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
      
      Speed
      
      The improved speed, at least on recent x86_64, comes from
      * Using fastrange instead of modulo (%)
      * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
      * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
      * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
      
      Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
      
      $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
      Build avg ns/key: 47.7135
      Mixed inside/outside queries...
        Single filter net ns/op: 26.2825
        Random filter net ns/op: 150.459
          Average FP rate %: 0.954651
      $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
      Build avg ns/key: 47.2245
      Mixed inside/outside queries...
        Single filter net ns/op: 63.2978
        Random filter net ns/op: 188.038
          Average FP rate %: 1.13823
      
      Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
      
      The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
      
      Accuracy
      
      The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
      within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
      
      Accuracy data (generalizes, except old impl gets worse with millions of keys):
      Memory bits per key: FP rate percent old impl -> FP rate percent new impl
      6: 5.70953 -> 5.69888
      8: 2.45766 -> 2.29709
      10: 1.13977 -> 0.959254
      12: 0.662498 -> 0.411593
      16: 0.353023 -> 0.0873754
      24: 0.261552 -> 0.0060971
      50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
      
      Fixes https://github.com/facebook/rocksdb/issues/5857
      Fixes https://github.com/facebook/rocksdb/issues/4120
      
      Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
      
      Compatibility
      
      Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
      
      Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
      
      Differential Revision: D18294749
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
      f059c7d9