提交 · 758fa8c35920071c0dd59624d9c9740deff83083 · kvdb / rocksdb

19 3月, 2014 2 次提交

Don't Finalize in CompactionPicker · 758fa8c3

由 Igor Canadi 提交于 3月 18, 2014

Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.

Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console

Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.

Short-term, I removed Finalize() from CompactionPicker.

Note: I believe this is an issue in current 2.7 version running in production.

Test Plan:
make check
Will also run db_stress to see if issue is gone

Reviewers: sdong, ljin, dhruba, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16983

758fa8c3

disable the log_number check in Recover() · 63cef900

由 Lei Jin 提交于 3月 18, 2014

Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.

Test Plan: make

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16923

63cef900

18 3月, 2014 4 次提交

Optimize fallocation · f26cb0f0

由 Igor Canadi 提交于 3月 17, 2014

Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761

f26cb0f0

Fix race condition in manifest roll · ae25742a

由 Igor Canadi 提交于 3月 17, 2014

Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked

If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)

I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.

However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16929

ae25742a

Correct the logic of MemTable::ShouldFlushNow(). · a5fafd4f

由 Yueh-Hsuan Chiang 提交于 3月 14, 2014

Summary:
Memtable will now be forced to flush if the one of the following
conditions is met:
1. Already allocated more than write_buffer_size + 60% arena block size.
   (the overflowing condition)
2. Unable to safely allocate one more arena block without hitting the
   overflowing condition AND the unused allocated memory < 25% arena
   block size.

Test Plan: make all check

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16893

a5fafd4f

Fix a bug that Prev() can hang. · c61c9830

由 sdong 提交于 3月 14, 2014

Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it.

Test Plan: make all check

Reviewers: igor, haobo

Reviewed By: igor

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16899

c61c9830

16 3月, 2014 1 次提交

fix: use the correct edit when comparing log_number · 0cf6c8f7

由 Lei Jin 提交于 3月 15, 2014

Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.

Test Plan: still running make whitebox_crash_test

Reviewers: igor, haobo, igor2

Reviewed By: igor2

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16911

0cf6c8f7

15 3月, 2014 5 次提交

journal log_number correctly in MANIFEST · 453ec52c

由 Lei Jin 提交于 3月 14, 2014

Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.

Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16875

453ec52c

C

Make it compile on Debian/GCC 4.7 · b9c78d2d
由 Caio SBA 提交于 3月 14, 2014

b9c78d2d
I

Missing includes · 2bad3cb0
由 Igor Canadi 提交于 3月 14, 2014

2bad3cb0

Fix HashSkipList and HashLinkedList SIGSEGV · 3c75cc15

由 Igor Canadi 提交于 3月 14, 2014

Summary:
Original Summary:
Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.

Update:
Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.

Test Plan: SIGSEGV and assertion-throwing test now works!

Reviewers: ljin, haobo, sdong

Reviewed By: sdong

CC: leveldb, ljin

Differential Revision: https://reviews.facebook.net/D16857

3c75cc15

I

Fix warning on Mac OS · 6c72079d
由 Igor Canadi 提交于 3月 14, 2014

6c72079d

14 3月, 2014 1 次提交

Fix extra compaction tasks scheduled after D16767 in some cases · 5aa81f04

由 sdong 提交于 3月 13, 2014

Summary:
With D16767, there is a case compaction tasks are scheduled infinitely:
(1) no flush thread is configured and more than 1 compaction threads
(2) a flush is going on by one compaction hread
(3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
In that case, a infinite loop will be formed.

This patch would fix it.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16863

5aa81f04

13 3月, 2014 3 次提交

A heuristic way to check if a memtable is full · 11da8bc5

由 Kai Liu 提交于 3月 12, 2014

Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.

Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"

Test Plan: N/A

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D15051

11da8bc5

I

Correct version of D16821 · 45ad75db
由 Igor Canadi 提交于 3月 12, 2014

45ad75db
I
Revert "Fix bad merge of D16791 and D16767" · 2b95dc15
由 Igor Canadi 提交于 3月 12, 2014
```
This reverts commit 839c8ecf.
```
2b95dc15

12 3月, 2014 4 次提交

Fix bad merge of D16791 and D16767 · 839c8ecf

由 sdong 提交于 3月 11, 2014

Summary: A bad Auto-Merge caused log buffer is flushed twice. Remove the unintended one.

Test Plan: Should already be tested (the code looks the same as when I ran unit tests).

Reviewers: haobo, igor

Reviewed By: haobo

CC: ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16821

839c8ecf

Fix data race against logging data structure because of LogBuffer · bd45633b

由 sdong 提交于 3月 11, 2014

Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.

With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.

Thanks @igor for this finding!

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: dhruba, ljin, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D16767

bd45633b

Temp Fix of LogBuffer flushing · 6c66bc08

由 sdong 提交于 3月 11, 2014

Summary: To temp fix the log buffer flushing. Flush the buffer inside the lock. Clean the trunk before we find an eventual fix.

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: ljin, leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D16791

6c66bc08

Add a comment after SignalAll() · cb980216

由 Igor Canadi 提交于 3月 11, 2014

Summary: Having code after SignalAll has already caused 2 bugs. Let's make sure this doesn't happen again.

Test Plan: no test

Reviewers: sdong, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16785

cb980216

11 3月, 2014 7 次提交

Call PurgeObsoleteFiles() only when HaveSomethingToDelete() · d5de22dc

由 Igor Canadi 提交于 3月 10, 2014

Summary: as title

Test Plan: fixed the build failure http://ci-builds.fb.com/job/rocksdb_build/987/console

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16743

d5de22dc

DBTest: remove perf_context's time > 0 check · fac58c05

由 sdong 提交于 3月 10, 2014

Summary: DBTest checks  perf_context.seek_internal_seek_time > 0 and perf_context.find_next_user_entry_time > 0, which is not reliable. Remove them.

Test Plan: ./db_test

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16737

fac58c05

[RocksDB] Minor cleanup of PurgeObsoleteFiles · a91aed61

由 Haobo Xu 提交于 3月 09, 2014

Summary: as title. also made info log output of file deletion a bit more descriptive.

Test Plan: make check; db_bench and look at LOG output

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16731

a91aed61

Consolidate SliceTransform object ownership · 8d007b4a

由 Lei Jin 提交于 3月 10, 2014

Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587

8d007b4a

[RocksDB] make sure KSVObsolete does not get accessed as a valid pointer. · 9e0e6aa7

由 Haobo Xu 提交于 3月 08, 2014

Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground.

Test Plan: make check

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16701

9e0e6aa7

[RocksDB] LogBuffer Cleanup · 66da4679

由 Haobo Xu 提交于 3月 09, 2014

Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707

66da4679

Add option verify_checksums_in_compaction · 04d2c26e

由 Igor Canadi 提交于 3月 10, 2014

Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695

04d2c26e

08 3月, 2014 1 次提交

use CAS when returning SuperVersion to ThreadLocal · e5fa4944

由 Lei Jin 提交于 3月 07, 2014

Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.

Test Plan:
make all check
running asan_check now

Reviewers: igor, haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16641

e5fa4944

07 3月, 2014 1 次提交

Delete local sv when destroying DB from stress test · eec86952

由 Igor Canadi 提交于 3月 06, 2014

Summary: Not deleting local SV caused some an crash test issue: http://ci-builds.fb.com/job/rocksdb_asan_crash_test/83/console

Test Plan: ran unit tests

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16635

eec86952

06 3月, 2014 5 次提交

Buffer info logs when picking compactions and write them out after releasing the mutex · ecb1ffa2

由 sdong 提交于 3月 04, 2014

Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515

ecb1ffa2

output perf_context in db_bench readrandom · 04298f8c

由 Lei Jin 提交于 3月 05, 2014

Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16575

04298f8c

fix db_bench to use HashSkipList for real · 64138b5d

由 Lei Jin 提交于 3月 05, 2014

Summary:
For HashSkipList case, DBImpl has sanity check to see if prefix_extractor in
options is the same as the one in memtable factory. If not, it falls
back to SkipList. As result, I was experimenting with SkipList
performance. No wonder it is much worse than LinkedList

Test Plan: ran benchmark

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16569

64138b5d

config max_background_flushes in db_bench · 51560ba7

由 Lei Jin 提交于 3月 05, 2014

Summary: as title

Test Plan: make release

Reviewers: haobo, sdong, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16437

51560ba7

MergingIterator assertion · c0ccf436

由 Igor Canadi 提交于 3月 05, 2014

Summary: I wrote a test that triggers assertion in MergingIterator. I have not touched that code ever, so I'm looking for somebody with good understanding of the MergingIterator code to fix this. The solution is probably a one-liner. Let me know if you're willing to take a look.

Test Plan: This test fails with an assertion `use_heap_ == false`

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16521

c0ccf436

05 3月, 2014 2 次提交

CleanupIteratorState() only to initialize DeletionState when super version cleanup needed · e8ecca9e

由 sdong 提交于 3月 04, 2014

Summary:
Two changes:
1. DeletionState is only constructed when cleaning up is needed
2. Fix the bug of deletion state construction bug. A change was made in a previous patch: https://reviews.facebook.net/rROCKSDB774ed89c2405ee058086b099cbc8b29e243739cc#71a34e2e However, it somehow got lost when merging

Test Plan: make all check

Reviewers: kailiu, haobo, igor

Reviewed By: igor

CC: igor, dhruba, i.am.jin.lei, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16233

e8ecca9e

make key evenly distributed between 0 and FLAGS_num · a5b1d2f1

由 Lei Jin 提交于 3月 04, 2014

Summary:
The issue is that when FLAGS_num is small, the leading bytes of the key
are padded with 0s. This makes all keys have the same prefix 00000000

Most of the changes are just to make lint happy

Test Plan: ran db_bench

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16317

a5b1d2f1

04 3月, 2014 2 次提交

Add a hash-index component for block · 906f3dca

由 kailiu 提交于 2月 18, 2014

Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.

Test Plan: added a unit test and passed it.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16245

906f3dca

Fix a group commit bug in LogAndApply · 5142b370

由 Igor Canadi 提交于 3月 03, 2014

Summary:
EncodeTo(&record) does not overwrite, it appends to it.

This means that group commit log and apply will look something like:
record1
record1record2
record1record2record3

I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.

This bug turned up in column family work, where opening a database failed with "adding a same column family twice".

Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16461

5142b370

01 3月, 2014 2 次提交

Remove the terrible hack in for flush_block_policy_factory · bf86af51

由 kailiu 提交于 2月 28, 2014

Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503

bf86af51

Make Log::Reader more robust · 58ca641d

由 Igor Canadi 提交于 2月 28, 2014

Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119

58ca641d

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功