1. 27 11月, 2019 6 次提交
    • A
      Fix HISTORY.md for 6.6.0 (#6096) · 496a6ae8
      anand76 提交于
      Summary:
      Some of the entries were incorrectly listed under 6.5.0.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6096
      
      Differential Revision: D18722801
      
      Pulled By: gfosco
      
      fbshipit-source-id: 18d1187deb6a9d69a8feb68b727d2f720a65f2bc
      496a6ae8
    • P
      Expose and elaborate FilterBuildingContext (#6088) · ca3b6c28
      Peter Dillinger 提交于
      Summary:
      This change enables custom implementations of FilterPolicy to
      wrap a variety of NewBloomFilterPolicy and select among them based on
      contextual information such as table level and compaction style.
      
      * Moves FilterBuildingContext to public API and elaborates it with more
      useful data. (It would be nice to put more general options-like data,
      but at the time this object is constructed, we are using internal APIs
      ImmutableCFOptions and MutableCFOptions and don't have easy access to
      ColumnFamilyOptions that I can tell.)
      
      * Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
      GetBuilderWithContext, because it's now public.
      
      * Plumbs through the table's "level_at_creation" for filter building
      context.
      
      * Simplified some tests by adding GetBuilder() to
      MockBlockBasedTableTester.
      
      * Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
      sample wrapper class LevelAndStyleCustomFilterPolicy.
      
      * Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
      where it does not reset perf context.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088
      
      Test Plan: make check, valgrind on db_bloom_filter_test
      
      Differential Revision: D18697817
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
      ca3b6c28
    • P
      Allow fractional bits/key in BloomFilterPolicy (#6092) · 57f30322
      Peter Dillinger 提交于
      Summary:
      There's no technological impediment to allowing the Bloom
      filter bits/key to be non-integer (fractional/decimal) values, and it
      provides finer control over the memory vs. accuracy trade-off. This is
      especially handy in using the format_version=5 Bloom filter in place
      of the old one, because bits_per_key=9.55 provides the same accuracy as
      the old bits_per_key=10.
      
      This change not only requires refining the logic for choosing the best
      num_probes for a given bits/key setting, it revealed a flaw in that logic.
      As bits/key gets higher, the best num_probes for a cache-local Bloom
      filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
      standard Bloom filter. For example, at 16 bits per key, the best
      num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
      This change fixes and refines that logic (for the format_version=5
      Bloom filter only, just in case) based on empirical tests to find
      accuracy inflection points between each num_probes.
      
      Although bits_per_key is now specified as a double, the new Bloom
      filter converts/rounds this to "millibits / key" for predictable/precise
      internal computations. Just in case of unforeseen compatibility
      issues, we round to the nearest whole number bits / key for the
      legacy Bloom filter, so as not to unlock new behaviors for it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092
      
      Test Plan: unit tests included
      
      Differential Revision: D18711313
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
      57f30322
    • S
      Make default value of options.ttl to be 30 days when it is supported. (#6073) · 77eab5c8
      sdong 提交于
      Summary:
      By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.
      
      Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.
      
      Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073
      
      Test Plan: Add unit tests for it.
      
      Differential Revision: D18669626
      
      fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
      77eab5c8
    • S
      Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072) · fcd7e038
      Sebastiano Peluso 提交于
      Summary:
      This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only.
      
      Issue: https://github.com/facebook/rocksdb/issues/4997
      
      Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072
      
      Reviewed By: pdillinger
      
      Differential Revision: D18687364
      
      Pulled By: sebastianopeluso
      
      fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e
      fcd7e038
    • S
      Update HISTORY.md for forward compatibility (#6085) · 0bc87442
      sdong 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/6060 broke forward compatiblity for releases from 3.10 to 4.2. Update HISTORY.md to mention it. Also remove it from the compatibility tests.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6085
      
      Differential Revision: D18691694
      
      fbshipit-source-id: 4ef903783dc722b8a4d3e8229abbf0f021a114c9
      0bc87442
  2. 23 11月, 2019 2 次提交
    • S
      Support ttl in Universal Compaction (#6071) · 669ea77d
      Sagar Vemuri 提交于
      Summary:
      `options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 .
      Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path.
      Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this.
      
      This is a short term work-around/hack of falling back to periodic compactions when ttl is set.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071
      
      Test Plan: Added a unit test.
      
      Differential Revision: D18668336
      
      Pulled By: sagar0
      
      fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612
      669ea77d
    • S
      Support options.ttl with options.max_open_files = -1 (#6060) · d8c28e69
      sdong 提交于
      Summary:
      Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.
      
      Note that, this change will break forward compatibility for release 5.1 and older.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060
      
      Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.
      
      Differential Revision: D18631623
      
      fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
      d8c28e69
  3. 21 11月, 2019 1 次提交
    • Y
      Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056) · 0ce0edbe
      Yanqin Jin 提交于
      Summary:
      Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
      called in DBImpl::GetColumnFamilyMetaData().
      Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
      for access to FileMetaData::being_compacted.
      Other than mutex, there are several more alternatives.
      
      - Make FileMetaData::being_compacted an atomic variable. This will make
        FileMetaData non-copy-able.
      
      - Separate being_compacted from FileMetaData. This requires re-organizing data
        structures that are already used in many places.
      
      Test Plan (dev server):
      ```
      make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056
      
      Differential Revision: D18620488
      
      Pulled By: riversand963
      
      fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
      0ce0edbe
  4. 20 11月, 2019 2 次提交
    • L
      Fix corruption with intra-L0 on ingested files (#5958) · ec3e3c3e
      Little-Wallace 提交于
      Summary:
      ## Problem Description
      
      Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.
      
      * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
      * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
      * `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
      * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
          * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
          * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
          > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno
      
      So I skip pick sst file which was ingested in function `FindIntraL0Compaction `
      
      ## Reason
      
      Here is my bug report: https://github.com/facebook/rocksdb/issues/5913
      
      There are two situations that can cause the check to fail.
      
      ### First situation:
      - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.
      
      - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.
      
      - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
      - RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.
      
      ### Secondary situaion
      
      - First we have flushed many sst in L0,  we call them [s1, s2, s3].
      
      - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.
      
      - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.
      
      - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
        - compacted 4@0 files to L0
      - When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
         - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958
      
      Differential Revision: D18601316
      
      fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
      ec3e3c3e
    • T
      Fix blob context when db_iter uses seek (#6051) · 20b48c64
      tabokie 提交于
      Summary:
      Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
      Also patch existing test for the mentioned code path.
      Signed-off-by: Ntabokie <xy.tao@outlook.com>
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051
      
      Differential Revision: D18596274
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
      20b48c64
  5. 14 11月, 2019 1 次提交
    • P
      New Bloom filter implementation for full and partitioned filters (#6007) · f059c7d9
      Peter Dillinger 提交于
      Summary:
      Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
      
      Speed
      
      The improved speed, at least on recent x86_64, comes from
      * Using fastrange instead of modulo (%)
      * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
      * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
      * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
      
      Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
      
      $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
      Build avg ns/key: 47.7135
      Mixed inside/outside queries...
        Single filter net ns/op: 26.2825
        Random filter net ns/op: 150.459
          Average FP rate %: 0.954651
      $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
      Build avg ns/key: 47.2245
      Mixed inside/outside queries...
        Single filter net ns/op: 63.2978
        Random filter net ns/op: 188.038
          Average FP rate %: 1.13823
      
      Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
      
      The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
      
      Accuracy
      
      The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
      within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
      
      Accuracy data (generalizes, except old impl gets worse with millions of keys):
      Memory bits per key: FP rate percent old impl -> FP rate percent new impl
      6: 5.70953 -> 5.69888
      8: 2.45766 -> 2.29709
      10: 1.13977 -> 0.959254
      12: 0.662498 -> 0.411593
      16: 0.353023 -> 0.0873754
      24: 0.261552 -> 0.0060971
      50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
      
      Fixes https://github.com/facebook/rocksdb/issues/5857
      Fixes https://github.com/facebook/rocksdb/issues/4120
      
      Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
      
      Compatibility
      
      Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
      
      Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
      
      Differential Revision: D18294749
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
      f059c7d9
  6. 13 11月, 2019 1 次提交
    • A
      Batched MultiGet API for multiple column families (#5816) · 6c7b1a0c
      anand76 提交于
      Summary:
      Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
      
      As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
      
      Test Plan:
      make check
      make asan_check
      asan_crash_test
      
      Differential Revision: D18408676
      
      Pulled By: anand1976
      
      fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
      6c7b1a0c
  7. 12 11月, 2019 2 次提交
    • A
      Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014) · 03ce7fb2
      anand76 提交于
      Summary:
      The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014
      
      Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.
      
      Differential Revision: D18412753
      
      Pulled By: anand1976
      
      fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
      03ce7fb2
    • S
      Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992) · c17384fe
      Sagar Vemuri 提交于
      Summary:
      When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.
      
      This will help start cascading compactions to move the expired key range to the bottom-most level faster.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992
      
      Test Plan: `make check`
      
      Differential Revision: D18257883
      
      Pulled By: sagar0
      
      fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc
      c17384fe
  8. 09 11月, 2019 1 次提交
    • P
      Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015) · aa63abf6
      Peter Dillinger 提交于
      Summary:
      Only if there is a crash, power failure, or I/O error in
      DeleteBackup, shared or private files from the backup might be left
      behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
      by GarbageCollect. This makes the BackupEngine API "leaky by default."
      Even if it means a modest performance hit, I think we should make
      Delete and Purge do as they say, with ongoing best effort: i.e. future
      calls will attempt to finish any incomplete work from earlier calls.
      
      This change does that by having DeleteBackup and PurgeOldBackups do a
      GarbageCollect, unless (to minimize performance hit) this BackupEngine
      has already done a GarbageCollect and there have been no
      deletion-related I/O errors in that GarbageCollect or since then.
      
      Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.
      
      Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015
      
      Test Plan: Updated unit tests
      
      Differential Revision: D18401333
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
      aa63abf6
  9. 08 11月, 2019 2 次提交
  10. 01 11月, 2019 3 次提交
    • P
      Add new persistent 64-bit hash (#5984) · 18f57f5e
      Peter Dillinger 提交于
      Summary:
      For upcoming new SST filter implementations, we will use a new
      64-bit hash function (XXH3 preview, slightly modified). This change
      updates hash.{h,cc} for that change, adds unit tests, and out-of-lines
      the implementations to keep hash.h as clean/small as possible.
      
      In developing the unit tests, I discovered that the XXH3 preview always
      returns zero for the empty string. Zero is problematic for some
      algorithms (including an upcoming SST filter implementation) if it
      occurs more often than at the "natural" rate, so it should not be
      returned from trivial values using trivial seeds. I modified our fork
      of XXH3 to return a modest hash of the seed for the empty string.
      
      With hash function details out-of-lines in hash.h, it makes sense to
      enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p
      are inlined. To fix array-bounds warnings on some inline calls, I
      injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.)
      Revised: Reverted using XXH_INLINE_ALL for now.  Some Facebook
      checks are unhappy about #include on xxhash.cc file. I would
      fix that by rename to xxhash_cc.h, but to best preserve history I want
      to do that in a separate commit (PR) from the uintptr casts.
      
      Also updated filter_bench for this change, improving the performance
      predictability of dry run hashing and adding support for 64-bit hash
      (for upcoming new SST filter implementations, minor dead code in the
      tool for now).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984
      
      Differential Revision: D18246567
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8
      18f57f5e
    • S
      Support periodic compaction in universal compaction (#5970) · aa6f7d09
      sdong 提交于
      Summary:
      Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970
      
      Test Plan: Add several test cases.
      
      Differential Revision: D18147097
      
      fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1
      aa6f7d09
    • S
      Make FIFO compaction take default 30 days TTL by default (#5987) · 2a9e5caf
      sdong 提交于
      Summary:
      Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987
      
      Test Plan: Add an option sanitize test to cover the case.
      
      Differential Revision: D18237935
      
      fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9
      2a9e5caf
  11. 30 10月, 2019 1 次提交
    • S
      Auto enable Periodic Compactions if a Compaction Filter is used (#5865) · 4c9aa30a
      Sagar Vemuri 提交于
      Summary:
      - Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
      - The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.
      
      Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.
      
      `periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.
      
      This is done only for Level Compaction style.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865
      
      Test Plan:
      - Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
      - `COMPILE_WITH_ASAN=1 make check`
      
      Differential Revision: D17659180
      
      Pulled By: sagar0
      
      fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
      4c9aa30a
  12. 29 10月, 2019 1 次提交
  13. 25 10月, 2019 2 次提交
    • Y
      Update column families' log number altogether after flushing during recovery (#5856) · 2309fd63
      Yanqin Jin 提交于
      Summary:
      A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
      The bug can surface in the following way.
      1. Database has multiple column families.
      2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
      3. During restart, DB crashes between flushing between two column families.
      
      Then DB will fail to be opened again with error "SST file is ahead of WALs".
      Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.
      
      Test plan (on devserver):
      ```
      $make all && make check
      ```
      Specifically
      ```
      $make db_test2
      $./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
      ```
      Also checked for compatibility as follows.
      Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
      Then checkout 5.4, build ldb, and dump the MANIFEST.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856
      
      Differential Revision: D17620818
      
      Pulled By: riversand963
      
      fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
      2309fd63
    • L
      Propagate SST and blob file numbers through the EventListener interface (#5962) · f7e7b34e
      Levi Tamasi 提交于
      Summary:
      This patch adds a number of new information elements to the FlushJobInfo and
      CompactionJobInfo structures that are passed to EventListeners via the
      OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks.
      Namely, for flushes, the file numbers of the new SST and the oldest blob file it
      references are propagated. For compactions, the new pieces of information are
      the file number, level, and the oldest blob file referenced by each compaction
      input and output file.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962
      
      Test Plan:
      Extended the EventListener unit tests with logic that checks that these information
      elements are correctly propagated from the corresponding FileMetaData.
      
      Differential Revision: D18095568
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56
      f7e7b34e
  14. 22 10月, 2019 2 次提交
    • P
      Fix memory leak on error opening PlainTable (#5951) · 27a12457
      Peter Dillinger 提交于
      Summary:
      Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of
      plain table error cases (more than before) has been added as BadOptions1 and BadOptions2
      to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test.
      
      Also here is a cheap fix for the memory leak, without (yet?) changing the signature of
      ReadTableProperties. This fixes ASAN on unit tests.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951
      
      Test Plan: make COMPILE_WITH_ASAN=1 check
      
      Differential Revision: D18051940
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de
      27a12457
    • S
      LevelIterator to avoid gap after prefix bloom filters out a file (#5861) · a0cd9200
      sdong 提交于
      Summary:
      Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861
      
      Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test
      
      Differential Revision: D17918213
      
      fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
      a0cd9200
  15. 19 10月, 2019 1 次提交
  16. 17 10月, 2019 1 次提交
  17. 16 10月, 2019 1 次提交
  18. 15 10月, 2019 1 次提交
  19. 09 10月, 2019 2 次提交
  20. 02 10月, 2019 1 次提交
    • S
      Revert "Merging iterator to avoid child iterator reseek for some cases (#5286)" (#5871) · 846e0500
      sdong 提交于
      Summary:
      This reverts commit 9fad3e21.
      
      Iterator verification in stress tests sometimes fail for assertion
      table/block_based/block_based_table_reader.cc:2973: void rocksdb::BlockBasedTableIterator<TBlockIter, TValue>::FindBlockForward() [with TBlockIter = rocksdb::DataBlockIter; TValue = rocksdb::Slice]: Assertion `!next_block_is_out_of_bound || user_comparator_.Compare(*read_options_.iterate_upper_bound, index_iter_->user_key()) <= 0' failed.
      
      It is likely to be linked to https://github.com/facebook/rocksdb/pull/5286 together with https://github.com/facebook/rocksdb/pull/5468 as the former PR makes some child iterator's seek being avoided, so that upper bound condition fails to be updated there. Strictly speaking, the former PR was merged before the latter one, but the latter one feels a more important improvement so I choose to revert the former one for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5871
      
      Differential Revision: D17689196
      
      fbshipit-source-id: 4ded5be68f67bee2782d31a29cb72ea68f59dd8c
      846e0500
  21. 27 9月, 2019 1 次提交
  22. 25 9月, 2019 1 次提交
    • M
      Fix a bug in format_version 3 + partition filters + prefix search (#5835) · 6652c94f
      Maysam Yabandeh 提交于
      Summary:
      Partitioned filters make use of a top-level index to find the partition in which the filter resides. The top-level index has a key per partition. The key is guaranteed to be larger or equal than any key in that partition. When used with format_version 3, which excludes the sequence number form index keys, the separator key in the index could be equal to the prefix of the keys in the next partition. In this way, when searching for the key, the top-level index will lead us to the previous partition, which has no key with that prefix. The prefix bloom test thus returns false, although the prefix exists in the bloom of the next partition.
      The patch fixes that by a hack: It always adds the prefix of the first key of the next partition to the bloom of the current partition. In this way, in the corner cases that the index will lead us to the previous partition, we still can find the bloom filter there.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5835
      
      Differential Revision: D17513585
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: e2d1ff26c759e6e03875c4d57f4228316ecf50e9
      6652c94f
  23. 24 9月, 2019 1 次提交
  24. 19 9月, 2019 1 次提交
  25. 17 9月, 2019 2 次提交