提交 · 669ea77d9fa1ea1b8f811cf312162ceb011476d4 · kvdb / rocksdb

23 11月, 2019 4 次提交

Support ttl in Universal Compaction (#6071) · 669ea77d

由 Sagar Vemuri 提交于 11月 22, 2019

Summary:
`options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 .
Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path.
Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this.

This is a short term work-around/hack of falling back to periodic compactions when ttl is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071

Test Plan: Added a unit test.

Differential Revision: D18668336

Pulled By: sagar0

fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612

669ea77d

Fix the constness issues around autovector::iterator_impl's dereference operators (#6057) · 75dfc788

由 Levi Tamasi 提交于 11月 22, 2019

Summary:
As described in detail in issue https://github.com/facebook/rocksdb/issues/6048, iterators' dereference operators
(`*`, `->`, and `[]`) should return `pointer`s/`reference`s (as opposed to
`const_pointer`s/`const_reference`s) even if the iterator itself is `const`
to be in sync with the standard's iterator concept.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6057

Test Plan: make check

Differential Revision: D18623235

Pulled By: ltamasi

fbshipit-source-id: 04e82d73bc0c67fb0ded018383af8dfc332050cc

75dfc788

Support options.ttl with options.max_open_files = -1 (#6060) · d8c28e69

由 sdong 提交于 11月 22, 2019

Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830

d8c28e69

Compatible changes for cmake (#6045) · adcf920f

由 suzanwen 提交于 11月 22, 2019

Summary:
`${TESTUTILLIB}` should be linked with targets`${LIBS}`, otherwise it may not find the references. After that, we have to work fine with `${CMAKE_CURRENT_SOURCE_DIR}` in `cmake/modules/ReadVersion.cmake`, while building external projects with `add_subdirectory(/path/to/rocksdb)`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6045

Differential Revision: D18641791

Pulled By: pdillinger

fbshipit-source-id: a56b03b4dda6bae6edce1375324f51340917dddc

adcf920f

22 11月, 2019 1 次提交

fix unstable unittest caused by #5958 (#6061) · e50b64bd

由 Little-Wallace 提交于 11月 21, 2019

Summary:
Signed-off-by: NLittle-Wallace <bupt2013211450@gmail.com>

This PR is to fix unstable unit test added by (https://github.com/facebook/rocksdb/pull/5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6061

Differential Revision: D18642301

fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148

e50b64bd

21 11月, 2019 4 次提交

Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056) · 0ce0edbe

由 Yanqin Jin 提交于 11月 20, 2019

Summary:
Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
called in DBImpl::GetColumnFamilyMetaData().
Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
for access to FileMetaData::being_compacted.
Other than mutex, there are several more alternatives.

- Make FileMetaData::being_compacted an atomic variable. This will make
  FileMetaData non-copy-able.

- Separate being_compacted from FileMetaData. This requires re-organizing data
  structures that are already used in many places.

Test Plan (dev server):
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056

Differential Revision: D18620488

Pulled By: riversand963

fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b

0ce0edbe

Add asserts in transaction example (#6055) · c0983d06

由 Cheng Chang 提交于 11月 20, 2019

Summary:
The intention of the example for read committed is clearer with these added asserts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6055

Test Plan: `cd examples && make transaction_example && ./transaction_example`

Differential Revision: D18621830

Pulled By: riversand963

fbshipit-source-id: a94b08c5958b589049409ee4fc4d6799e5cbef79

c0983d06

Add operator[] to autovector::iterator_impl. (#6047) · 3cd75736

由 Stephan T. Lavavej 提交于 11月 20, 2019

Summary:
This is a required operator for random-access iterators, and an upcoming update for Visual Studio 2019 will change the C++ Standard Library's heap algorithms to use this operator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6047

Differential Revision: D18618531

Pulled By: ltamasi

fbshipit-source-id: 08d10bc85bf2dbc3f7ef0fa3c777e99f1e927ef5

3cd75736

Sanitize input in DB::MultiGet() API (#6054) · 27ec3b34

由 sdong 提交于 11月 20, 2019

Summary:
The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054

Test Plan: Build with GCC-9 and see it passes.

Differential Revision: D18608958

fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2

27ec3b34

20 11月, 2019 7 次提交

Fixes for g++ 4.9.2 compatibility (#6053) · 0306e012

由 Peter Dillinger 提交于 11月 19, 2019

Summary:
Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043

Stackoverflow ref on {{}} vs. {}:
https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list

Note to reader: .clear() does not empty out an ostringstream, but .str("")
suffices because we don't have to worry about clearing error flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053

Test Plan: make check, manual run of filter_bench

Differential Revision: D18602259

Pulled By: pdillinger

fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841

0306e012

Fix corruption with intra-L0 on ingested files (#5958) · ec3e3c3e

由 Little-Wallace 提交于 11月 19, 2019

Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
* If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`.
* If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
> s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: https://github.com/facebook/rocksdb/issues/5913

There are two situations that can cause the check to fail.

### First situation：
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0, we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6.
- compacted 4@0 files to L0
- When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
- s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267

ec3e3c3e

Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052) · 019eb1f4

由 Levi Tamasi 提交于 11月 19, 2019

Summary:
The SetOptions API used by the test is not supported in LITE mode,
so we should skip the new chunk in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052

Test Plan: Ran the unit tests both in regular and LITE mode.

Differential Revision: D18601763

Pulled By: ltamasi

fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04

019eb1f4

db_stress sometimes generates keys close to SST file boundaries (#6037) · 4e0dcd36

由 sdong 提交于 11月 19, 2019

Summary:
Recently, a bug was found related to a seek key that is close to SST file boundary. However, it only occurs in a very small chance in db_stress, because the chance that a random key hits SST file boundaries is small. To boost the chance, with 1/16 chance, we pick keys that are close to SST file boundaries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6037

Test Plan: Did some manual printing out, and hack to cover the key generation logic to be correct.

Differential Revision: D18598476

fbshipit-source-id: 13b76687d106c5be4e3e02a0c77fa5578105a071

4e0dcd36

Fix blob context when db_iter uses seek (#6051) · 20b48c64

由 tabokie 提交于 11月 19, 2019

Summary:
Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
Also patch existing test for the mentioned code path.
Signed-off-by: Ntabokie <xy.tao@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051

Differential Revision: D18596274

Pulled By: ltamasi

fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278

20b48c64

Fix test failure in LITE mode (#6050) · 38cc6112

由 anand76 提交于 11月 19, 2019

Summary:
GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050

Test Plan:
make LITE=1 check
make check

Differential Revision: D18588114

Pulled By: anand1976

fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d

38cc6112

Remove a few unnecessary includes · ac498cdb

由 Peter Dillinger 提交于 11月 19, 2019

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6046

Test Plan: make check, manual inspection

Differential Revision: D18573044

Pulled By: pdillinger

fbshipit-source-id: 7a5999fc08d798ce3157b56d4b36d24027409fc3

ac498cdb

19 11月, 2019 3 次提交

Mark blob files not needed by any memtables/SSTs obsolete (#6032) · 279c4883

由 Levi Tamasi 提交于 11月 18, 2019

Summary:
The patch adds logic to mark no longer needed blob files obsolete upon database open
and whenever a flush or compaction completes. Unneeded blob files are detected by
iterating through live immutable non-TTL blob files starting from the lowest-numbered one,
and stopping when a blob file used by any SSTs or potentially used by memtables is found.
(The latter is determined by comparing the sequence number at which the blob file
became immutable with the largest sequence number received in flush notifications.)

In addition, the patch cleans up the logic around closing and obsoleting blob files and
enforces invariants around this area (blob files are now guaranteed to go through the
stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this
order).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032

Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`.

Differential Revision: D18495610

Pulled By: ltamasi

fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961

279c4883

db_stress to cover total order seek (#6039) · a150604e

由 sdong 提交于 11月 18, 2019

Summary:
Right now, in db_stress, as long as prefix extractor is defined, TestIterator always uses. There is value of cover total_order_seek = true when prefix extractor is define. Add a small chance that this flag is turned on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6039

Test Plan: Run the test for a while.

Differential Revision: D18539689

fbshipit-source-id: 568790dd7789c9986b83764b870df0423a122d99

a150604e

Fix a test failure on systems that don't have Snappy compression libraries (#6038) · 5b9233bf

由 anand76 提交于 11月 18, 2019

Summary:
The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038

Differential Revision: D18532370

Pulled By: anand1976

fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075

5b9233bf

16 11月, 2019 1 次提交

Fix IngestExternalFile's bug with two_write_queue (#5976) · f65ec09e

由 Little-Wallace 提交于 11月 15, 2019

Summary:
When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock.
The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable.
Fixes https://github.com/facebook/rocksdb/issues/5974
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976

Differential Revision: D18535943

Pulled By: maysamyabandeh

fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc

f65ec09e

15 11月, 2019 3 次提交

Disable SmallestUnCommittedSeq in Valgrind run (#6035) · 0058daef

由 Maysam Yabandeh 提交于 11月 14, 2019

Summary:
SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035

Differential Revision: D18509198

Pulled By: maysamyabandeh

fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230

0058daef

Abandon use of folly::Optional (#6036) · 00d58a37

由 Peter Dillinger 提交于 11月 14, 2019

Summary:
Had complications with LITE build and valgrind test.
Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036

Test Plan:
make LITE=1 all check
and
ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test

Differential Revision: D18512238

Pulled By: pdillinger

fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28

00d58a37

crash_test: use large max_manifest_file_size most of the time. (#6034) · 6123611c

由 sdong 提交于 11月 14, 2019

Summary:
Right now, crash_test always uses 16KB max_manifest_file_size value. It is good to cover logic of manifest file switch. However, information stored in manifest files might be useful in debugging failures. Switch to only use small manifest file size in 1/15 of the time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6034

Test Plan: Observe command generated by db_crash_test.py multiple times and see the --max_manifest_file_size value distribution.

Differential Revision: D18513824

fbshipit-source-id: 7b3ae6dbe521a0918df41064e3fa5ecbf2466e04

6123611c

14 11月, 2019 4 次提交

More fixes to auto-GarbageCollect in BackupEngine (#6023) · e8e7fb1d

由 Peter Dillinger 提交于 11月 14, 2019

Summary:
Production:
* Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
* Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
* Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023

Test Plan:
* Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
* Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
* Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.

Differential Revision: D18453559

Pulled By: pdillinger

fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4

e8e7fb1d

New Bloom filter implementation for full and partitioned filters (#6007) · f059c7d9

由 Peter Dillinger 提交于 11月 13, 2019

Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f

f059c7d9

fix typo (#6025) · f382f44e

由 Fatih Şentürk 提交于 11月 13, 2019

Summary:
fix a typo at java readme page
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6025

Differential Revision: D18481232

fbshipit-source-id: 1c70c2435bcd4b02f25e28cd7e35c42273e07be0

f382f44e

Fix a regression bug on total order seek with prefix enabled and range delete (#6028) · bb23bfe6

由 sdong 提交于 11月 13, 2019

Summary:
Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
Fix the bug by setting prefix_extractor to be null if total order seek is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028

Test Plan: Add a unit test which fails without the fix.

Differential Revision: D18479063

fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3

bb23bfe6

13 11月, 2019 2 次提交

Fix BloomFilterPolicy changes for unsigned char (ARM) (#6024) · 42b5494e

由 Peter Dillinger 提交于 11月 12, 2019

Summary:
Bug in PR https://github.com/facebook/rocksdb/issues/5941 when char is unsigned that should only affect
assertion on unused/invalid filter metadata.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6024

Test Plan: on ARM: ./bloom_test && ./db_bloom_filter_test && ./block_based_filter_block_test && ./full_filter_block_test && ./partitioned_filter_block_test

Differential Revision: D18461206

Pulled By: pdillinger

fbshipit-source-id: 68a7c813a0b5791c05265edc03cdf52c78880e9a

42b5494e

Batched MultiGet API for multiple column families (#5816) · 6c7b1a0c

由 anand76 提交于 11月 12, 2019

Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d

6c7b1a0c

12 11月, 2019 5 次提交

db_stress to cover SeekForPrev() (#6022) · a19de78d

由 sdong 提交于 11月 11, 2019

Summary:
Right now, db_stress doesn't cover SeekForPrev(). Add the coverage, which mirrors what we do for Seek().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6022

Test Plan: Run "make crash_test". Do some manual source code hack to simular iterator wrong results and see it caught.

Differential Revision: D18442193

fbshipit-source-id: 879b79000d5e33c625c7e970636de191ccd7776c

a19de78d

Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014) · 03ce7fb2

由 anand76 提交于 11月 11, 2019

Summary:
The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014

Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.

Differential Revision: D18412753

Pulled By: anand1976

fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72

03ce7fb2

蔡

bugfix: MemTableList::RemoveOldMemTables invalid iterator after remov… (#6013) · f29e6b3b

由蔡渠棠提交于 11月 11, 2019

Summary:
Fix issue https://github.com/facebook/rocksdb/issues/6012.

I found that it may be caused by the following codes in function _RemoveOldMemTables()_ in **db/memtable_list.cc**  :
```
  for (auto it = memlist.rbegin(); it != memlist.rend(); ++it) {
    MemTable* mem = *it;
    if (mem->GetNextLogNumber() > log_number) {
      break;
    }
    current_->Remove(mem, to_delete);
```

The iterator **it** turns invalid after `current_->Remove(mem, to_delete);`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6013

Test Plan:
```
make check
```

Differential Revision: D18401107

Pulled By: riversand963

fbshipit-source-id: bf0da3b868ed70f7aff24cf7b3e2049c0c5c7a4e

f29e6b3b

Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992) · c17384fe

由 Sagar Vemuri 提交于 11月 11, 2019

Summary:
When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.

This will help start cascading compactions to move the expired key range to the bottom-most level faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992

Test Plan: `make check`

Differential Revision: D18257883

Pulled By: sagar0

fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc

c17384fe

BlobDB: Maintain mapping between blob files and SSTs (#6020) · 8e7aa628

由 Levi Tamasi 提交于 11月 11, 2019

Summary:
The patch adds logic to BlobDB to maintain the mapping between blob files
and SSTs for which the blob file in question is the oldest blob file referenced
by the SST file. The mapping is initialized during database open based on the
information retrieved using `GetLiveFilesMetaData`, and updated after
flushes/compactions based on the information received through the `EventListener`
interface (or, in the case of manual compactions issued through the `CompactFiles`
API, the `CompactionJobInfo` object).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020

Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`.

Differential Revision: D18410508

Pulled By: ltamasi

fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5

8e7aa628

09 11月, 2019 2 次提交

Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015) · aa63abf6

由 Peter Dillinger 提交于 11月 08, 2019

Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.

This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.

Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.

Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015

Test Plan: Updated unit tests

Differential Revision: D18401333

Pulled By: pdillinger

fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925

aa63abf6

Fix DBFlushTest::FireOnFlushCompletedAfterCommittedResult hang (#6018) · 72de842a

由 Yi Wu 提交于 11月 08, 2019

Summary:
The test would fire two flushes to let them run in parallel. Previously it wait for the first job to be scheduled before firing the second. It is possible the job is not started before the second job being scheduled, making the two job combine into one. Change to wait for the first job being started.

Fixes https://github.com/facebook/rocksdb/issues/6017
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6018

Test Plan:
```
while ./db_flush_test --gtest_filter=*FireOnFlushCompletedAfterCommittedResult*; do :; done
```
and let it run for a while.
Signed-off-by: NYi Wu <yiwu@pingcap.com>

Differential Revision: D18405576

Pulled By: riversand963

fbshipit-source-id: 6ebb6262e033d5dc2ef81cb3eb410b314f2de4c9

72de842a

08 11月, 2019 4 次提交

Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011) · f80050fa

由 Levi Tamasi 提交于 11月 07, 2019

Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9

f80050fa

Download bzip2 packages from sourceforge (#5995) · 07a0ad3c

由 Yun Tang 提交于 11月 07, 2019

Summary:
From bzip2's official [download page](http://www.bzip.org/downloads.html), we could download it from sourceforge. This source would be more credible than previous web archive.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5995

Differential Revision: D18377662

fbshipit-source-id: e8353f83d5d6ea6067f78208b7bfb7f0d5b49c05

07a0ad3c

Fix MultiGet crash when no_block_cache is set (#5991) · 9836a1fa

由 anand76 提交于 11月 07, 2019

Summary:
This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991

Test Plan:
1. Add unit tests that fail with the old code and pass with the new
2. make check and asan_check

Cc spetrunia

Differential Revision: D18272744

Pulled By: anand1976

fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe

9836a1fa

Stress test to relax the iterator verification case for lower bound (#5869) · 1da1f042

由 sdong 提交于 11月 07, 2019

Summary:
In stress test, all iterator verification is turned off is lower bound is enabled. This might be stricter than needed. This PR relaxes the condition and include the case where lower bound is lower than both of seek key and upper bound. It seems to work mostly fine when I run crash test locally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5869

Test Plan: Run crash_test

Differential Revision: D18363578

fbshipit-source-id: 23d57e11ea507949b8100f4190ddfbe8db052d5a

1da1f042

kvdb / rocksdb 10 个月 前同步成功

kvdb / rocksdb
10 个月前同步成功