提交 · 5601bc4619402676900539c4ef828cedfa998a32 · kvdb / rocksdb

16 3月, 2014 1 次提交

fix: use the correct edit when comparing log_number · 0cf6c8f7

由 Lei Jin 提交于 3月 15, 2014

Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.

Test Plan: still running make whitebox_crash_test

Reviewers: igor, haobo, igor2

Reviewed By: igor2

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16911

0cf6c8f7

15 3月, 2014 1 次提交

journal log_number correctly in MANIFEST · 453ec52c

由 Lei Jin 提交于 3月 14, 2014

Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.

Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16875

453ec52c

11 3月, 2014 1 次提交

Add option verify_checksums_in_compaction · 04d2c26e

由 Igor Canadi 提交于 3月 10, 2014

Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695

04d2c26e

06 3月, 2014 1 次提交

Buffer info logs when picking compactions and write them out after releasing the mutex · ecb1ffa2

由 sdong 提交于 3月 04, 2014

Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515

ecb1ffa2

04 3月, 2014 1 次提交

Fix a group commit bug in LogAndApply · 5142b370

由 Igor Canadi 提交于 3月 03, 2014

Summary:
EncodeTo(&record) does not overwrite, it appends to it.

This means that group commit log and apply will look something like:
record1
record1record2
record1record2record3

I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.

This bug turned up in column family work, where opening a database failed with "adding a same column family twice".

Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16461

5142b370

01 3月, 2014 2 次提交

Remove the terrible hack in for flush_block_policy_factory · bf86af51

由 kailiu 提交于 2月 28, 2014

Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503

bf86af51

Make Log::Reader more robust · 58ca641d

由 Igor Canadi 提交于 2月 28, 2014

Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119

58ca641d

14 2月, 2014 1 次提交

Expose the table properties to application · 63690625

由 kailiu 提交于 2月 13, 2014

Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083

63690625

13 2月, 2014 1 次提交

preload table handle on Recover() when max_open_files == -1 · 5fbf2ef4

由 Lei Jin 提交于 2月 12, 2014

Summary: This covers existing table files before DB open happens and avoids contention on table cache

Test Plan: db_test

Reviewers: haobo, sdong, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16089

5fbf2ef4

04 2月, 2014 1 次提交

[Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte. · d169b676

由 Siying Dong 提交于 1月 27, 2014

Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15489

d169b676

03 2月, 2014 1 次提交

First phase API clean up · 4f6cb17b

由 kailiu 提交于 1月 27, 2014

Summary:
Addressed all the issues in https://reviews.facebook.net/D15447.
Now most table-related modules are hidden from user land.

Test Plan: make check

Reviewers: sdong, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15525

4f6cb17b

31 1月, 2014 1 次提交

The default settings enable checksum verification on every read. · abd70ecc

由 Dhruba Borthakur 提交于 1月 30, 2014

Summary: The default settings enable checksum verification on every read.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15591

abd70ecc

29 1月, 2014 1 次提交

Only get the manifest file size if there is no error · 5d2c6282

由 Igor Canadi 提交于 1月 28, 2014

Summary:
I came across this while working on column families. CorruptionTest::RecoverWriteError threw a SIGSEG because the descriptor_log_->file() was nullptr. I'm not sure why it doesn't happen in master, but better safe than sorry.

@kailiu, can we get this in release, too?

Test Plan: make check

Reviewers: kailiu, dhruba, haobo

Reviewed By: haobo

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D15513

5d2c6282

28 1月, 2014 2 次提交

Fsync directory after we create a new file · 832158e7

由 Igor Canadi 提交于 1月 27, 2014

Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751

832158e7

Move NeedsCompaction() from VersionSet to Version · 6c2ca1d3

由 Igor Canadi 提交于 1月 27, 2014

Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.

Test Plan: make check

Reviewers: kailiu, haobo, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15333

6c2ca1d3

25 1月, 2014 2 次提交

Fix reduce levels · 04afa321

由 Igor Canadi 提交于 1月 24, 2014

ReduceNumberOfLevels had segmentation fault in WriteSnapshot() since we
didn't change the number of levels in VersionSet (we consider them
immutable from now on). This fixes the problem.

04afa321

Make VersionSet::ReduceNumberOfLevels() static · 677fee27

由 Igor Canadi 提交于 1月 24, 2014

Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.

Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.

Test Plan: reduce_levels_test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15303

677fee27

23 1月, 2014 1 次提交

Unfriending classes · fb01755a

由 Igor Canadi 提交于 1月 22, 2014

Summary:
In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends.

I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :)

Test Plan: make check

Reviewers: haobo, kailiu, sdong, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15267

fb01755a

17 1月, 2014 2 次提交

Remove compaction pointers · 6d6fb709

由 Igor Canadi 提交于 1月 16, 2014

Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15225

6d6fb709

CompactionPicker · c699c84a

由 Igor Canadi 提交于 1月 16, 2014

Summary:
This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.

To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.

In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.

This diff depends on D15171, D15183, D15189 and D15201

Test Plan: `make check`

Reviewers: kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15207

c699c84a

16 1月, 2014 5 次提交

Remove the unnecessary use of shared_ptr · eae1804f

由 kailiu 提交于 1月 15, 2014

Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.

According to igor's previous work, we are likely to make quite some performance gain from this diff.

Test Plan: make check

Reviewers: dhruba, igor, sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15213

eae1804f

Move more functions from VersionSet to Version · 787f11bb

由 Igor Canadi 提交于 1月 15, 2014

Summary:
This moves functions:
* VersionSet::Finalize() -> Version::UpdateCompactionStats()
* VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize()

The diff depends on D15189, D15183 and D15171

Test Plan: make check

Reviewers: kailiu, sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15201

787f11bb

Moving Compaction class to separate header file · 615d1ea2

由 Igor Canadi 提交于 1月 15, 2014

Summary:
I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.

The diff depends on D15171 and D15183

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15189

615d1ea2

Move functions from VersionSet to Version · 2f4eda78

由 Igor Canadi 提交于 1月 15, 2014

Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.

The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files

The diff continues on (and depends on) D15171

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong, emayanke

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15183

2f4eda78

Decrease reliance on VersionSet::NumberLevels() · 65a8a52b

由 Igor Canadi 提交于 1月 15, 2014

Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171

65a8a52b

15 1月, 2014 3 次提交

Fix CompactRange to apply filter to every key · d9cd7a06

由 Igor Canadi 提交于 1月 14, 2014

Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421

d9cd7a06

VersionEdit not to take NumLevels() · 055e6df4

由 Igor Canadi 提交于 1月 14, 2014

Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.

Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.

This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15159

055e6df4

Pre-calculate whether to slow down for too many level 0 files · fbbf0d14

由 Siying Dong 提交于 1月 14, 2014

Summary: Currently in DBImpl::MakeRoomForWrite(), we do "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.

Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.

Reviewers: haobo, kailiu, igor

Reviewed By: kailiu

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D15141

fbbf0d14

11 1月, 2014 1 次提交

[Performance Branch] If options.max_open_files set to be -1, cache table... · aa0ef660

由 Siying Dong 提交于 1月 06, 2014

[Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()

Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().

We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.

Test Plan: add a test case in db_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D15039

aa0ef660

03 1月, 2014 1 次提交

Some minor refactoring on the code · 476416c2

由 kailiu 提交于 12月 30, 2013

Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.

Test Plan: ran `make check`

Reviewers: haobo, dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15009

476416c2

27 12月, 2013 1 次提交

Avoid malloc in NotFound key status if no message is given. · 18df47b7

由 Siying Dong 提交于 12月 26, 2013

Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691

18df47b7

18 12月, 2013 1 次提交

Move level0 sorting logic from Version::SaveTo() to Version::Finalize() · 14995a8f

由 Siying Dong 提交于 12月 11, 2013

Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.

Test Plan: make all check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14607

14995a8f

12 12月, 2013 4 次提交

[RocksDB][Performance Branch]Iterator Cleanup method only tries to find... · aaf9c620

由 Siying Dong 提交于 12月 11, 2013

[RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version

Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14595

aaf9c620

Introduce MergeContext to Lazily Initialize merge operand list · a8029fdc

由 Siying Dong 提交于 12月 02, 2013

Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc

a8029fdc

[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting... · bc5dd19b

由 Siying Dong 提交于 12月 09, 2013

[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()

Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409

bc5dd19b

[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting... · 41349d9e

由 Siying Dong 提交于 12月 09, 2013

[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()

Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409

41349d9e

07 12月, 2013 1 次提交

[RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list · ef2211a9

由 Siying Dong 提交于 12月 02, 2013

Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

ef2211a9

26 11月, 2013 2 次提交

During benchmarking, I see excessive use of vector.reserve(). · 8478f380

由 Dhruba Borthakur 提交于 11月 25, 2013

Summary:
This code path can potentially accumulate multiple important_files for level 0.
But for other levels, it should have only one file in the
important_files, so it is ok not to reserve excessive space, is it not?

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14349

8478f380

[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally · 5b825d69

由 Haobo Xu 提交于 11月 22, 2013

Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289

5b825d69

22 11月, 2013 1 次提交
- S
  Revert "Allow users to profile a query and see bottleneck of the query" · 3e35aa64
  由 Siying Dong 提交于 11月 21, 2013
```
This reverts commit 3d8ac31d.
```
  3e35aa64

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功