提交 · 672befea2a514e32c8506389883f552129d2d5eb · kvdb / rocksdb

31 8月, 2019 1 次提交

Fix assertion failure in FIFO compaction with TTL (#5754) · 672befea

由 Yanqin Jin 提交于 8月 30, 2019

Summary:
Before this PR, the following sequence of events can cause assertion failure as shown below.
Stack trace (partial):
```
(gdb) bt
2  0x00007f59b350ad15 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:92
3  0x00007f59b350adc3 in __GI___assert_fail (assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:101
4  0x0000000000492ccd in rocksdb::Compaction::MarkFilesBeingCompacted (this=<optimized out>, mark_as_compacted=<optimized out>) at db/compaction/compaction.cc:394
5  0x000000000049467a in rocksdb::Compaction::Compaction (this=0x7f59af013000, vstorage=0x7f581af53030, _immutable_cf_options=..., _mutable_cf_options=..., _inputs=..., _output_level=<optimized out>, _target_file_size=0, _max_compaction_bytes=0, _output_path_id=0, _compression=<incomplete type>, _compression_opts=..., _max_subcompactions=0, _grandparents=..., _manual_compaction=false, _score=4, _deletion_compaction=true, _compaction_reason=rocksdb::CompactionReason::kFIFOTtl) at db/compaction/compaction.cc:241
6  0x00000000004af9bc in rocksdb::FIFOCompactionPicker::PickTTLCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:101
7  0x00000000004b0771 in rocksdb::FIFOCompactionPicker::PickCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:201
8  0x00000000004838cc in rocksdb::ColumnFamilyData::PickCompaction (this=this@entry=0x7f59b31b3700, mutable_options=..., log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/column_family.cc:933
9  0x00000000004f3645 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7f59b3176000, made_progress=made_progress@entry=0x7f59b1bfa6bf, job_context=job_context@entry=0x7f59b1bfa760, log_buffer=log_buffer@entry=0x7f59b1bfa930, prepicked_compaction=prepicked_compaction@entry=0x0, thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2541
10 0x00000000004f5e2a in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7f59b3176000, prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2312
11 0x00000000004f648e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2087
```
This can be caused by the following sequence of events.
```
Time
|      thr          bg_compact_thr1                     bg_compact_thr2
|      write
|      flush
|                   mark all l0 as being compacted
|      write
|      flush
|                   add cf to queue again
|                                                       mark all l0 as being
|                                                       compacted, fail the
|                                                       assertion
V
```
Test plan (on devserver)
Since bg_compact_thr1 and bg_compact_thr2 are two threads executing the same
code, it is difficult to use sync point dependency to
coordinate their execution. Therefore, I choose to use db_stress.
```
$TEST_TMPDIR=/dev/shm/rocksdb ./db_stress --periodic_compaction_seconds=1 --max_background_compactions=20 --format_version=2 --memtablerep=skip_list --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --recycle_log_file_num=0 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=22 --compaction_ttl=1 --block_size=16384 --use_multiget=1 --compact_files_one_in=1000000 --target_file_size_multiplier=2 --clear_column_family_one_in=0 --max_bytes_for_level_base=10485760 --use_full_merge_v1=1 --target_file_size_base=2097152 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=zstd --writepercent=35 --readpercent=45 --subcompactions=4 --use_merge=0 --write_buffer_size=4194304 --test_batches_snapshots=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --use_direct_reads=0 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=1000000 --prefixpercent=5 --flush_one_in=1000000 --ops_per_thread=40000 --index_block_restart_interval=7 --cache_size=1048576 --compaction_style=2 --verify_checksum=1 --delrangepercent=1 --use_direct_io_for_flush_and_compaction=0
```
This should see no assertion failure.
Last but not least,
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5754

Differential Revision: D17109791

Pulled By: riversand963

fbshipit-source-id: 25fc46101235add158554e096540b72c324be078

672befea

30 8月, 2019 5 次提交

Adopt Contributor Covenant · 9a449865

由 Paul O'Shannessy 提交于 8月 29, 2019

Summary:
In order to foster healthy open source communities, we're adopting the
[Contributor Covenant](https://www.contributor-covenant.org/). It has been
built by open source community members and represents a shared understanding of
what is expected from a healthy community.

Reviewed By: josephsavona, danobi, rdzhabarov

Differential Revision: D17104640

fbshipit-source-id: d210000de686c5f0d97d602b50472d5869bc6a49

9a449865

Lower the risk for users to run options.force_consistency_checks = true (#5744) · a2818223

由 Pratik Dhandharia 提交于 8月 29, 2019

Summary:
Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744

Differential Revision: D17096940

Pulled By: pdhandharia

fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c

a2818223

Disable MultiGet row cache test in LITE mode (#5756) · 1729779b

由 anand76 提交于 8月 29, 2019

Summary:
Row cache is not supported in LITE mode. So disable the test in that mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5756

Test Plan: make LITE=1 all check

Differential Revision: D17115684

Pulled By: anand1976

fbshipit-source-id: e6433c2e528674645cea76cdfc80ddc473708fc2

1729779b

Add Crux to USERS.md · c5e12ebf

由 Jeremy Taylor 提交于 8月 29, 2019

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5718

Differential Revision: D17096939

Pulled By: riversand963

fbshipit-source-id: 4301078d3ca3d54a1c7e841eccad95379cd1570d

c5e12ebf

Fix comment of function NotifyCollectTableCollectorsOnFinish (#5738) · ab0645a5

由 Shafreeck Sea 提交于 8月 29, 2019

Summary:
Signed-off-by: NShafreeck Sea <shafreeck@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5738

Differential Revision: D17097075

Pulled By: riversand963

fbshipit-source-id: ed01b5f59e8eed262a49abe1f96552842d364af1

ab0645a5

29 8月, 2019 1 次提交

Support row cache with batched MultiGet (#5706) · e1057033

由 anand76 提交于 8月 28, 2019

Summary:
This PR adds support for row cache in ```rocksdb::TableCache::MultiGet```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5706

Test Plan:
1. Unit tests in db_basic_test
2. db_bench results with batch size of 2 (```Get``` is faster than ```MultiGet``` for single key) -
Get -
readrandom : 3.935 micros/op 254116 ops/sec; 28.1 MB/s (22870998 of 22870999 found)
MultiGet -
multireadrandom : 3.743 micros/op 267190 ops/sec; (24047998 of 24047998 found)

Command used -
TEST_TMPDIR=/dev/shm/multiget numactl -C 10 ./db_bench -use_existing_db=true -use_existing_keys=false -benchmarks="readtorowcache,[read|multiread]random" -write_buffer_size=16777216 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -row_cache_size=4194304000 -batch_size=2 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=131072

Differential Revision: D17086297

Pulled By: anand1976

fbshipit-source-id: 85784378da913e05f1baf31ec1b4e7c9345e7f57

e1057033

28 8月, 2019 2 次提交

crash_test to skip compaction TTL for FIFO compaction. (#5749) · 1daff8f8

由 sdong 提交于 8月 27, 2019

Summary:
https://github.com/facebook/rocksdb/pull/5741 added compaction TTL to crash test, but it causes assertion fails for FIFO compaction. Disable this combination for now while we debug the assertion failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5749

Test Plan: Run crash test and observe that when compaction_style=2, compaction_ttl is always 0.

Differential Revision: D17078292

fbshipit-source-id: 446821a3b9739956094d5e4f9be1251a15b57f5d

1daff8f8

replace some reinterpret_cast with static_cast_with_check (#5740) · 1b4c104a

由 Pratik Dhandharia 提交于 8月 27, 2019

Summary:
This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>.

Files impacted:

./db/db_impl/db_impl_compaction_flush.cc
./db/write_batch.cc
./utilities/blob_db/blob_db_impl.cc
./utilities/transactions/pessimistic_transaction_db.cc
./utilities/transactions/transaction_base.cc
./utilities/transactions/write_prepared_txn_db.cc
./utilities/transactions/write_unprepared_txn_db.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740

Differential Revision: D17055691

Pulled By: pdhandharia

fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8

1b4c104a

27 8月, 2019 3 次提交

Extend stress test to cover periodic compaction and compaction TTL (#5741) · 1d6a10f5

由 sdong 提交于 8月 26, 2019

Summary:
Covering periodic compaction and compaction TTL can help us expose potential issues. Add it there.
Randomly select value for these two options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5741

Test Plan: Run crash_test and see the perameters generated.

Differential Revision: D17059515

fbshipit-source-id: 8213974846a0b6a22fc13be705825c9054d1d097

1d6a10f5

Reduce severity of too many levels log message (#5742) · ba0967b5

由 Andrew Kryczka 提交于 8月 26, 2019

Summary:
This condition is now a normal occurrence during write burst so there is
no need to warn the user about it. Here is a scenario where it happens
under completely normal conditions.

* Initially we have a DB of three levels (L0, L1, and L2) that is stable, i.e., compaction scores are all less than one.
* Now a write burst comes along. At first L0 blows up a bit in size as compaction hasn't had a chance to catch up.
* As a result of the above, `base_bytes_min` also increases since it is based on L0 size as of https://github.com/facebook/rocksdb/issues/4338
* If `base_bytes_min` increased enough (i.e., to be larger than L1), then we are shown the warning that the DB has more levels than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5742

Differential Revision: D17059221

fbshipit-source-id: e4a31d6eea42089a8d273095f19653991bd91bea

ba0967b5

reuse scratch buffer in transaction_log_reader (#5702) · 62829ff7

由 jsteemann 提交于 8月 26, 2019

Summary:
in order to avoid reallocations for a scratch std::string on every call to Next().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5702

Differential Revision: D16867803

fbshipit-source-id: 1391220a1b172b23336bbc71dc0c79ccf3b1c701

62829ff7

24 8月, 2019 2 次提交

Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe

由 Zhongyi Xie 提交于 8月 23, 2019

Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5

2f41ecfe

crc32c_arm64 performance optimization (#5675) · 26293c89

由 DaiZhiwei 提交于 8月 23, 2019

Summary:
Crc32c Parallel computation coding optimization:
Macro unfolding removes the "for" loop and is good to decrease branch-miss in arm64 micro architecture
1024 Bytes is divided into  8(head) + 1008( 6 * 7 * 3 * 8 ) + 8(tail)  three parts
Macro unfolding 42 loops to 6 CRC32C7X24BYTESs
1 CRC32C7X24BYTES containing 7 CRC32C24BYTESs

1, crc32c_test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN      ] CRC.StandardResults
[       OK ] CRC.StandardResults (1 ms)
[ RUN      ] CRC.Values
[       OK ] CRC.Values (0 ms)
[ RUN      ] CRC.Extend
[       OK ] CRC.Extend (0 ms)
[ RUN      ] CRC.Mask
[       OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[  PASSED  ] 4 tests.

2, db_bench --benchmarks="crc32c"
crc32c : 0.218 micros/op 4595390 ops/sec; 17950.7 MB/s (4096 per op)

3, repeated crc32c_test case  60000 times
perf stat -e branch-miss -- ./crc32c_test
before optimization:
739,426,504      branch-miss
after optimization:
1,128,572      branch-miss
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5675

Differential Revision: D16989210

fbshipit-source-id: 7204e6069bb6ed066d49c2d1b3ac385065a98557

26293c89

23 8月, 2019 3 次提交

Revert to storing UncompressionDicts in the cache (#5645) · df8c307d

由 Levi Tamasi 提交于 8月 23, 2019

Summary:
PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point
of the digest is to create it once and reuse it over and over again. This patch goes back to
storing the uncompression dictionary itself in the cache (which should be now safe to do,
since it no longer includes a Statistics pointer), while preserving the rest of the refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645

Test Plan: make asan_check

Differential Revision: D16551864

Pulled By: ltamasi

fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800

df8c307d

Atomic Flush Crash Test also covers the case that WAL is enabled. (#5729) · d8a27d93

由 sdong 提交于 8月 22, 2019

Summary:
AtomicFlushStressTest is a powerful test, but right now we only run it for atomic_flush=true + disable_wal=true. We further extend it to the case where atomic_flush=false + disable_wal = false. All the workload generation and validation can stay the same.
Atomic flush crash test is also changed to switch between the two test scenarios. It makes the name "atomic flush crash test" out of sync from what it really does. We leave it as it is to avoid troubles with continous test set-up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5729

Test Plan: Run "CRASH_TEST_KILL_ODD=188 TEST_TMPDIR=/dev/shm/ USE_CLANG=1 make whitebox_crash_test_with_atomic_flush", observe the settings used and see it passed.

Differential Revision: D16969791

fbshipit-source-id: 56e37487000ae631e31b0100acd7bdc441c04163

d8a27d93

Fix local includes · 202942b2

由 Patrick Pei 提交于 8月 22, 2019

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5722

Differential Revision: D16908380

fbshipit-source-id: 6a0e3cb2730b08d6012d3d7f31c937f01c399846

202942b2

22 8月, 2019 2 次提交

Refactor MultiGet names in BlockBasedTable (#5726) · 244e6f20

由 Maysam Yabandeh 提交于 8月 22, 2019

Summary:
To improve code readability, since RetrieveBlock already calls MaybeReadBlockAndLoadToCache, we avoid name similarity of the functions that call RetrieveBlock with MaybeReadBlockAndLoadToCache. The patch thus renames MaybeLoadBlocksToCache to RetrieveMultipleBlock and deletes GetDataBlockFromCache, which contains only two lines.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5726

Differential Revision: D16962535

Pulled By: maysamyabandeh

fbshipit-source-id: 99e8946808ce4eb7857592b9003812e3004f92d6

244e6f20

Fix MultiGet() bug when whole_key_filtering is disabled (#5665) · 9046bdc5

由 anand76 提交于 8月 21, 2019

Summary:
The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.

Tests:
Unit test (confirmed failed before the fix)
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665

Differential Revision: D16902380

Pulled By: anand1976

fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00

9046bdc5

21 8月, 2019 3 次提交

Disable snapshot refresh feature when snap_refresh_nanos is 0 (#5724) · 7bc18e27

由 Maysam Yabandeh 提交于 8月 20, 2019

Summary:
The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278
The patch fixes that and also adds an assert to ensure that the feature is not used when the option is zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724

Differential Revision: D16918185

Pulled By: maysamyabandeh

fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e

7bc18e27

Introduce IngestExternalFileOptions.verify_checksums_readahead_size (#5721) · 35524736

由 sdong 提交于 8月 20, 2019

Summary:
Recently readahead is introduced for checksum verifying. However, users cannot override the setting for the checksum verifying before external SST file ingestion. Introduce a new option for the purpose.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5721

Test Plan: Add a new unit test for it.

Differential Revision: D16906896

fbshipit-source-id: 218ec37001ddcc05411cefddbe233d15ab308476

35524736

Bump up memory order of ref counting of ColumnFamilyData (#5723) · 4c74dba5

由 sdong 提交于 8月 20, 2019

Summary:
We see this TSAN warning:

WARNING: ThreadSanitizer: data race (pid=282806)
  Write of size 8 at 0x7b6c00000e38 by thread T16 (mutexes: write M1023578822185846136):
    #0 operator delete(void*) <null> (libtsan.so.0+0x0000000795f8)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2202 (db_flush_test+0x00000060b462)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2226 (db_flush_test+0x00000060cbd8)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BGWorkFlush(void*) db/db_impl/db_impl_compaction_flush.cc:2073 (db_flush_test+0x00000060d5ac)
    ......

Previous atomic write of size 4 at 0x7b6c00000e38 by main thread:
    #0 __tsan_atomic32_fetch_sub <null> (libtsan.so.0+0x00000006d721)
    https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<int>::fetch_sub(int, std::memory_order) /mnt/gvfs/third-party2/libgcc/c67031f0f739ac61575a061518d6ef5038f99f90/7.x/platform007/5620abc/include/c++/7.3.0/bits/atomic_base.h:524 (db_flush_test+0x0000005f9e38)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::ColumnFamilyData::Unref() db/column_family.h:286 (db_flush_test+0x0000005f9e38)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) db/db_impl/db_impl_compaction_flush.cc:1624 (db_flush_test+0x0000005f9e38)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::TEST_FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&) db/db_impl/db_impl_debug.cc:127 (db_flush_test+0x00000061ace9)
    https://github.com/facebook/rocksdb/issues/5 rocksdb::DBFlushTest_CFDropRaceWithWaitForFlushMemTables_Test::TestBody() db/db_flush_test.cc:320 (db_flush_test+0x0000004b44e5)
    https://github.com/facebook/rocksdb/issues/6 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc:3824 (db_flush_test+0x000000be2988)
    ......

It's still very clear the cause of the warning is because that TSAN treats results from relaxed atomic::fetch_sub() as non-atomic with the operation itself. We can make it more explicit by bumping up the order to CS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5723

Test Plan: Run all existing test.

Differential Revision: D16908250

fbshipit-source-id: bf17d39ed19058372bdf97f6440a743f88153021

4c74dba5

20 8月, 2019 1 次提交

Slightly adjust atomic white box test's kill odd (#5717) · 8e12638f

由 sdong 提交于 8月 19, 2019

Summary:
Atomic white box test's kill odd is the same as normal test. However, in the scenario that only WritableFileWriter::Append() is blacklisted, WritableFileWriter::Flush() dominates the killing odds. Normally, most of WritableFileWriter::Flush() are called in WAL writes, where every write triggers a WAL flush. In atomic test, WAL is disabled, so the kill happens less frequently than we antipated. In some rare cases, the kill didn't end up with happening (for reasons I still don't fully understand) and cause the stress test timeout.

If WAL is disabled, make the odds 5x likely to trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5717

Test Plan: Run whitebox_crash_test_with_atomic_flush and whitebox_crash_test and observe the kill odds printed out.

Differential Revision: D16897237

fbshipit-source-id: cbf5d96f6fc0e980523d0f1f94bf4e72cdb82d1c

8e12638f

17 8月, 2019 11 次提交

Do readahead in VerifyChecksum() (#5713) · e1c468d1

由 sdong 提交于 8月 16, 2019

Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713

Test Plan: Add a new unit test.

Differential Revision: D16860874

fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413

e1c468d1

add missing check for hash index when calling BlockBasedTableIterator (#5712) · e89b1c9c

由 Zhongyi Xie 提交于 8月 16, 2019

Summary:
Previous PR https://github.com/facebook/rocksdb/pull/3601 added support for making prefix_extractor dynamically mutable. However, there was a missing check for hash index when creating new BlockBasedTableIterator. While the check may be redundant because no other types of IndexReader makes uses of the flag, it is less error-prone to add the missing check so that future index reader implementation will not worry about violating the contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5712

Differential Revision: D16842052

Pulled By: miasantreble

fbshipit-source-id: aef11c0ff7a690ed248f5b8fe23481cac486b381

e89b1c9c

Fixes for building RocksJava releases on arm64v8 · f2bf0b2d

由 Adam Retter 提交于 8月 16, 2019

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5674

Differential Revision: D16870338

fbshipit-source-id: c8dac644b1479fa734b491f3a8d50151772290f7

f2bf0b2d

cmake: s/SNAPPY_LIBRARIES/snappy_LIBRARIES/ (#5687) · 35fe6854

由 Kefu Chai 提交于 8月 16, 2019

Summary:
fix the regression introduced by cc9fa7fcSigned-off-by: NKefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5687

Differential Revision: D16870212

fbshipit-source-id: 78b5519e1d2b03262d102ca530491254ddffdc38

35fe6854

Blacklist TransactionTest.GetWithoutSnapshot from valgrind_test (#5715) · e0515607

由 sdong 提交于 8月 16, 2019

Summary:
In valgrind_test, TransactionTest.GetWithoutSnapshot ran 2 hours and still didn't finish. Black list from valgrind_test to prevent timeout.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5715

Test Plan: run "make valgrind_test" and see whether the test is still generated.

Differential Revision: D16866009

fbshipit-source-id: 92c78049b0bc1c2b9a0dfc1b7c8a9206b36f02f0

e0515607

Update HISTORY.md for 6.4.0 (#5714) · 353a68d5

由 Yanqin Jin 提交于 8月 16, 2019

Summary:
Update HISTORY.md by removing a feature from "Unreleased" to 6.4.0 after cherry-picking related commits to 6.4.fb branch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5714

Differential Revision: D16865334

Pulled By: riversand963

fbshipit-source-id: f17ede905a1dfbbcdf98806ca398c618cf54748a

353a68d5

fix compiling with `-DNPERF_CONTEXT` (#5704) · a2e46eae

由 jsteemann 提交于 8月 16, 2019

Summary:
This was previously broken, as the performance context-related
macro signatures in file monitoring/perf_context_imp.h
deviated for the case when NPERF_CONTEXT was defined and when it
was not.

Update the macros for the `-DNPERF_CONTEXT` case, so it compiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5704

Differential Revision: D16867746

fbshipit-source-id: 05539724cb1f7955ecc42828365836a677759ad9

a2e46eae

Optimizing ApproximateSize to create index iterator just once (#5693) · c2404d99

由 Eli Pozniansky 提交于 8月 16, 2019

Summary:
VersionSet::ApproximateSize doesn't need to create two separate index iterators and do binary search for each in BlockBasedTable. So BlockBasedTable::ApproximateSize was added that creates the iterator once and uses it to calculate the data size between start and end keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5693

Differential Revision: D16774056

Pulled By: elipoz

fbshipit-source-id: 53ce262e1a057788243bf30cd9b8aa6581df1a18

c2404d99

fix compile error: ‘FALLOC_FL_KEEP_SIZE’ undeclared (#5708) · c762efc4

由 sheng qiu 提交于 8月 16, 2019

Summary:
add "linux/falloc.h" in env/io_posix.cc to fix compile error: ‘FALLOC_FL_KEEP_SIZE’ undeclared
Signed-off-by: Nsheng qiu <herbert1984106@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5708

Differential Revision: D16832922

fbshipit-source-id: 30e787c4a1b5a9724a8acfd68962ff5ec5f27d3e

c762efc4

ThreadPoolImpl::Impl::BGThreadWrapper() returns void (#5709) · 40712df9

由 Kefu Chai 提交于 8月 16, 2019

Summary:
there is no need to return void*, as
std::thread::thread(Func&& f, Args&&... args ) only requires `Func` to
be callable.
Signed-off-by: NKefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5709

Differential Revision: D16832894

fbshipit-source-id: a1e1b876fa8d55589ef5feb5b27f3a435068b747

40712df9

Update HISTORY.md for 6.3.2/6.4.0 and add a not-yet-released change · 3a3dc294

由 Levi Tamasi 提交于 8月 16, 2019

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5710

Test Plan: HISTORY.md-only change, no testing required.

Differential Revision: D16836869

Pulled By: ltamasi

fbshipit-source-id: 978148f1d14b0c46839a94d7ada8a5e8ecf73965

3a3dc294

16 8月, 2019 2 次提交

Add command "list_file_range_deletes" in ldb (#5615) · bd2c753d

由 sdong 提交于 8月 15, 2019

Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615

Test Plan: Add a new unit test

Differential Revision: D16550326

fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e

bd2c753d

Blog post for write_unprepared (#5711) · 6ec2bf3f

由 Maysam Yabandeh 提交于 8月 15, 2019

Summary:
Introducing write_unprepared feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5711

Differential Revision: D16838307

Pulled By: maysamyabandeh

fbshipit-source-id: d9a4daf63dd0f855bea49c14ce84e6299f1401c7

6ec2bf3f

15 8月, 2019 4 次提交

Fix IngestExternalFile overlapping check (#5649) · d61d4507

由 Jeffrey Xiao 提交于 8月 14, 2019

Summary:
Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5649

Differential Revision: D16808765

Pulled By: anand1976

fbshipit-source-id: 5c7ad1c027e4f778d35070e5dae1b8e6037e0d68

d61d4507

Fix regression affecting partitioned indexes/filters when... · d92a59b6

由 Levi Tamasi 提交于 8月 14, 2019

Fix regression affecting partitioned indexes/filters when cache_index_and_filter_blocks is false (#5705)

Summary:
PR https://github.com/facebook/rocksdb/issues/5298 (and subsequent related patches) unintentionally changed the
semantics of cache_index_and_filter_blocks: historically, this option
only affected the main index/filter block; with the changes, it affects
index/filter partitions as well. This can cause performance issues when
cache_index_and_filter_blocks is false since in this case, partitions are
neither cached nor preloaded (i.e. they are loaded on demand upon each
access). The patch reverts to the earlier behavior, that is, partitions
are cached similarly to data blocks regardless of the value of the above
option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5705

Test Plan:
make check
./db_bench -benchmarks=fillrandom --statistics --stats_interval_seconds=1 --duration=30 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false
./db_bench -benchmarks=readrandom --use_existing_db --statistics --stats_interval_seconds=1 --duration=10 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false --cache_size=8000000000

Relevant statistics from the readrandom benchmark with the old code:

rocksdb.block.cache.index.miss COUNT : 0
rocksdb.block.cache.index.hit COUNT : 0
rocksdb.block.cache.index.add COUNT : 0
rocksdb.block.cache.index.bytes.insert COUNT : 0
rocksdb.block.cache.index.bytes.evict COUNT : 0
rocksdb.block.cache.filter.miss COUNT : 0
rocksdb.block.cache.filter.hit COUNT : 0
rocksdb.block.cache.filter.add COUNT : 0
rocksdb.block.cache.filter.bytes.insert COUNT : 0
rocksdb.block.cache.filter.bytes.evict COUNT : 0

With the new code:

rocksdb.block.cache.index.miss COUNT : 2500
rocksdb.block.cache.index.hit COUNT : 42696
rocksdb.block.cache.index.add COUNT : 2500
rocksdb.block.cache.index.bytes.insert COUNT : 4050048
rocksdb.block.cache.index.bytes.evict COUNT : 0
rocksdb.block.cache.filter.miss COUNT : 2500
rocksdb.block.cache.filter.hit COUNT : 4550493
rocksdb.block.cache.filter.add COUNT : 2500
rocksdb.block.cache.filter.bytes.insert COUNT : 10331040
rocksdb.block.cache.filter.bytes.evict COUNT : 0

Differential Revision: D16817382

Pulled By: ltamasi

fbshipit-source-id: 28a516b0da1f041a03313e0b70b28cf5cf205d00

d92a59b6

Fix TSAN failures in DistributedMutex tests (#5684) · 77273d41

由 Aaryaman Sagar 提交于 8月 14, 2019

Summary:
TSAN was not able to correctly instrument atomic bts and btr instructions, so
when TSAN is enabled implement those with std::atomic::fetch_or and
std::atomic::fetch_and. Also disable tests that fail on TSAN with false
negatives (we know these are false negatives because this other verifiably
correct program fails with the same TSAN error <link>)

```
make clean
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g COMPILE_WITH_TSAN=1 make J=1 -j56 folly_synchronization_distributed_mutex_test
```

This is the code that fails with the same false-negative with TSAN
```
namespace {
class ExceptionWithConstructionTrack : public std::exception {
 public:
  explicit ExceptionWithConstructionTrack(int id)
      : id_{folly::to<std::string>(id)}, constructionTrack_{id} {}

  const char* what() const noexcept override {
    return id_.c_str();
  }

 private:
  std::string id_;
  TestConstruction constructionTrack_;
};

template <typename Storage, typename Atomic>
void transferCurrentException(Storage& storage, Atomic& produced) {
  assert(std::current_exception());
  new (&storage) std::exception_ptr(std::current_exception());
  produced->store(true, std::memory_order_release);
}

void concurrentExceptionPropagationStress(
    int numThreads,
    std::chrono::milliseconds milliseconds) {
  auto&& stop = std::atomic<bool>{false};
  auto&& exceptions = std::vector<std::aligned_storage<48, 8>::type>{};
  auto&& produced = std::vector<std::unique_ptr<std::atomic<bool>>>{};
  auto&& consumed = std::vector<std::unique_ptr<std::atomic<bool>>>{};
  auto&& consumers = std::vector<std::thread>{};
  for (auto i = 0; i < numThreads; ++i) {
    produced.emplace_back(new std::atomic<bool>{false});
    consumed.emplace_back(new std::atomic<bool>{false});
    exceptions.push_back({});
  }

  auto producer = std::thread{[&]() {
    auto counter = std::vector<int>(numThreads, 0);
    for (auto i = 0; true; i = ((i + 1) % numThreads)) {
      try {
        throw ExceptionWithConstructionTrack{counter.at(i)++};
      } catch (...) {
        transferCurrentException(exceptions.at(i), produced.at(i));
      }

      while (!consumed.at(i)->load(std::memory_order_acquire)) {
        if (stop.load(std::memory_order_acquire)) {
          return;
        }
      }

      consumed.at(i)->store(false, std::memory_order_release);
    }
  }};

  for (auto i = 0; i < numThreads; ++i) {
    consumers.emplace_back([&, i]() {
      auto counter = 0;
      while (true) {
        while (!produced.at(i)->load(std::memory_order_acquire)) {
          if (stop.load(std::memory_order_acquire)) {
            return;
          }
        }
        produced.at(i)->store(false, std::memory_order_release);

        try {
          auto storage = &exceptions.at(i);
          auto exc = folly::launder(
            reinterpret_cast<std::exception_ptr*>(storage));
          auto copy = std::move(*exc);
          exc->std::exception_ptr::~exception_ptr();
          std::rethrow_exception(std::move(copy));
        } catch (std::exception& exc) {
          auto value = std::stoi(exc.what());
          EXPECT_EQ(value, counter++);
        }

        consumed.at(i)->store(true, std::memory_order_release);
      }
    });
  }

  std::this_thread::sleep_for(milliseconds);
  stop.store(true);
  producer.join();
  for (auto& thread : consumers) {
    thread.join();
  }
}
} // namespace
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5684

Differential Revision: D16746077

Pulled By: miasantreble

fbshipit-source-id: 8af88dcf9161c05daec1a76290f577918638f79d

77273d41

WriteUnPrepared: Fix bug in savepoints (#5703) · 7785f611

由 Manuel Ung 提交于 8月 14, 2019

Summary:
Fix a bug in write unprepared savepoints. When flushing the write batch according to savepoint boundaries, we were forgetting to flush the last write batch after the last savepoint, meaning that some data was not written to DB.

Also, add a small optimization where we avoid flushing empty batches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5703

Differential Revision: D16811996

Pulled By: lth

fbshipit-source-id: 600c7e0e520ad7a8fad32d77e11d932453e68e3f

7785f611

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功