提交 · a84f547084f039e586799e75ef64e0ef8769ecdd · kvdb / rocksdb

01 7月, 2013 3 次提交

D
Merge branch 'performance' of github.com:facebook/rocksdb into performance · a84f5470
由 Dhruba Borthakur 提交于 6月 30, 2013
```
Conflicts:
	db/db_bench.cc
	db/version_set.cc
	include/leveldb/options.h
	util/options.cc
```
a84f5470

Reduce write amplification by merging files in L0 back into L0 · 47c4191f

由 Dhruba Borthakur 提交于 6月 13, 2013

Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289

47c4191f

Reduce write amplification by merging files in L0 back into L0 · 554c06dd

由 Dhruba Borthakur 提交于 6月 13, 2013

Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289

554c06dd

27 6月, 2013 2 次提交

[RocksDB] Expose count for WriteBatch · 71e0f695

由 Haobo Xu 提交于 6月 26, 2013

Summary: As title. Exposed a Count function that returns the number of updates in a batch. Could be handy for replication sequence number check.

Test Plan: make check;

Reviewers: emayanke, sheki, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11523

71e0f695

Added stringappend_test back into the unit tests. · 34ef8732

由 Deon Nicholas 提交于 6月 26, 2013

Summary:
With the Makefile now updated to correctly update all .o files, this
should fix the issues recompiling stringappend_test. This should also fix the
"segmentation-fault" that we were getting earlier. Now, stringappend_test should
be clean, and I have added it back to the unit-tests. Also made some minor updates
to the tests themselves.

Test Plan:
1. make clean; make stringappend_test -j 32	(will test it by itself)
2. make clean; make all check -j 32		(to run all unit tests)
3. make clean; make release			(test in release mode)
4. valgrind ./stringappend_test 		(valgrind tests)

Reviewers: haobo, jpaton, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11505

34ef8732

26 6月, 2013 1 次提交

Updated "make clean" to remove all .o files · 6894a50a

由 Deon Nicholas 提交于 6月 25, 2013

Summary:
The old Makefile did not remove ALL .o and .d files, but rather only
those that happened to be in the root folder and one-level deep. This was causing
issues when recompiling files in deeper folders. This fix now causes make clean
to find ALL .o and .d files via a unix "find" command, and then remove them.

Test Plan:
make clean;
make all -j 32;

Reviewers: haobo, jpaton, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11493

6894a50a

22 6月, 2013 1 次提交

Simplify bucketing logic in ldb-ttl · b858da70

由 Mayank Agarwal 提交于 6月 20, 2013

Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.

Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1

Reviewers: vamsi, haobo

Reviewed By: vamsi

Differential Revision: https://reviews.facebook.net/D11433

b858da70

20 6月, 2013 4 次提交

Introducing timeranged scan, timeranged dump in ldb. Also the ability to count... · 61f1baae

由 Mayank Agarwal 提交于 6月 18, 2013

Introducing timeranged scan, timeranged dump in ldb. Also the ability to count in time-batches during Dump

Summary:
Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.

Test Plan: python tools/ldb_test.py

Reviewers: vamsi, dhruba, haobo, sheki

Reviewed By: vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11403

61f1baae

[RocksDB] add back --mmap_read options to crashtest · 0f78fad9

由 Haobo Xu 提交于 6月 19, 2013

Summary: As title, now that db_stress supports --map_read properly

Test Plan: make crash_test

Reviewers: vamsi, emayanke, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11391

0f78fad9

[RocksDB] Minor change to statistics.h · 4deaa0d4

由 Haobo Xu 提交于 6月 19, 2013

Summary: as title, use initialize list so that lines fit in 80 chars.

Test Plan: make check;

Reviewers: sheki, dhruba

Differential Revision: https://reviews.facebook.net/D11385

4deaa0d4

[RocksDB] Add mmap_read option for db_stress · 96be2c4e

由 Haobo Xu 提交于 6月 17, 2013

Summary: as title, also removed an incorrect assertion

Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11367

96be2c4e

19 6月, 2013 5 次提交

[rocksdb][refactor] statistic printing code to one place · 5ef6bb8c

由 Abhishek Kona 提交于 6月 18, 2013

Summary: $title

Test Plan: db_bench --statistics=1

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11373

5ef6bb8c

Fix Zlib_Compress and Zlib_Uncompress · 09de7a3b

由 Jim Paton 提交于 6月 18, 2013

Summary:
Zlib_{Compress,Uncompress} did not handle very small input buffers properly. In addition, they did not call inflate/deflate until Z_STREAM_END was returned; it was possible for them to exit when only Z_OK had returned.

This diff also fixes a bunch of lint errors.

Test Plan: Run make check

Reviewers: dhruba, sheki, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11301

09de7a3b

[RocksDB] Option for incremental sync · 3cc1af20

由 Haobo Xu 提交于 6月 13, 2013

Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11295

3cc1af20

[Rocksdb] Simplify Printing code in db_bench · 79f4fd2b

由 Abhishek Kona 提交于 6月 18, 2013

Summary:
simplify the printing code in db_bench
         use TickersMap and HistogramsNameMap introduced in previous diffs.

Test Plan: ./db_bench --statistics=1 and see if all the statistics are printed

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11355

79f4fd2b

Compact multiple memtables before flushing to storage. · 6acbe0fc

由 Dhruba Borthakur 提交于 6月 11, 2013

Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.

There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11241

6acbe0fc

18 6月, 2013 4 次提交

A

[Rocksdb] Rename one stat key from leveldb to rocksdb · f561b3a3
由 Abhishek Kona 提交于 6月 17, 2013

f561b3a3

Enhance dbstress to allow specifying compaction trigger for L0. · 836534de

由 Dhruba Borthakur 提交于 6月 17, 2013

Summary:
Rocksdb allos specifying the number of files in L0 that triggers
compactions. Expose this api as a command line parameter for
running db_stress.

Test Plan: Run test

Reviewers: sheki, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11343

836534de

[rocksdb] do not trim range for level0 in manual compaction · 00124683

由 Abhishek Kona 提交于 6月 17, 2013

Summary:
https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178

Ported the solution as is to RocksDB.

Test Plan: moved the unit test as manual_compaction_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11331

00124683

[Rocksdb] Record WriteBlock Times into a histogram · 39ee47fb

由 Abhishek Kona 提交于 6月 17, 2013

Summary: Add a histogram to track WriteBlock times

Test Plan: db_bench and print

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11319

39ee47fb

15 6月, 2013 4 次提交

Minor tweaks to StringAppend MergeOperator. · 8926b727

由 Deon Nicholas 提交于 6月 14, 2013

Summary:
I'm concerned about a random seg-fault that sometimes occurs when
running stringappend_test. I will investigate further. First, I am removing
stringappend_test from the regular release tests, and making some clean-ups
to the code.

Test Plan:
1. make stringappend_test
2. ./stringappend_test

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11313

8926b727

[Rocksdb] Implement filluniquerandom · bff718d8

由 Abhishek Kona 提交于 6月 14, 2013

Summary:
Use a bit set to keep track of which random number is generated.
        Currently only supports single-threaded. All our perf tests are run with threads=1
        Copied over bitset implementation from common/datastructures

Test Plan: printed the generated keys, and verified all keys were present.

Reviewers: MarkCallaghan, haobo, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11247

bff718d8

Fix db_bench for release build. · 2a52e1dc

由 Deon Nicholas 提交于 6月 14, 2013

Test Plan: make release

Reviewers: haobo, dhruba, jpaton

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11307

2a52e1dc

[RocksDB] Compaction Filter Cleanup · 1afdf287

由 Haobo Xu 提交于 6月 06, 2013

Summary: This hopefully gives the right semantics to compaction filter. Will write a small wiki to explain the ideas.

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11121

1afdf287

14 6月, 2013 2 次提交

[Rocksdb] measure table open io in a histogram · 7a5f71d1

由 Abhishek Kona 提交于 6月 13, 2013

Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times.

Test Plan: db_bench --statistics=1

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D11217

7a5f71d1

[RocksDB] Fix build. Removed deprecated option --mmap_read from db_crashtest · 0c2a2dd5

由 Haobo Xu 提交于 6月 13, 2013

Summary: As title

Test Plan: db_crashtest

Reviewers: vamsi, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11271

0c2a2dd5

13 6月, 2013 3 次提交

[RocksDB] Sync file to disk incrementally · 778e1790

由 Haobo Xu 提交于 6月 04, 2013

Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.

Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change: await occillate between 10ms and 3ms
After the change: await ranges 1-3ms

Will test against read-modify-write workload too, see if high read latency P99 could be resolved.

Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11115

778e1790

[Rocksdb] [Multiget] Introduced multiget into db_bench · 4985a9f7

由 Deon Nicholas 提交于 6月 12, 2013

Summary:
Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
flags for db_bench. Also updated and tested the ReadRandom() method
to include an option to use multiget. By default,
keys_per_multiget=100.

Preliminary tests imply that multiget is at least 1.25x faster per
key than regular get.

Will continue adding Multiget for ReadMissing, ReadHot,
RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
about ways to better verify benchmarks.

Test Plan:
1. make db_bench
2. ./db_bench --benchmarks=fillrandom
3. ./db_bench --benchmarks=readrandom --use_existing_db=1
	      --use_multiget=1 --threads=4 --keys_per_multiget=100
4. ./db_bench --benchmarks=readrandom --use_existing_db=1
	      --threads=4
5. Verify ops/sec (and 1000000 of 1000000 keys found)

Reviewers: haobo, MarkCallaghan, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11127

4985a9f7

[RocksDB] cleanup EnvOptions · bdf10859

由 Haobo Xu 提交于 6月 07, 2013

Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11175

bdf10859

12 6月, 2013 1 次提交

Completed the implementation and test cases for Redis API. · 5679107b

由 Deon Nicholas 提交于 6月 11, 2013

Summary:
Completed the implementation for the Redis API for Lists.
The Redis API uses rocksdb as a backend to persistently
store maps from key->list. It supports basic operations
for appending, inserting, pushing, popping, and accessing
a list, given its key.

Test Plan:
  - Compile with: make redis_test
  - Test with: ./redis_test
  - Run all unit tests (for all rocksdb) with: make all check
  - To use an interactive REDIS client use: ./redis_test -m
  - To clean the database before use:       ./redis_test -m -d

Reviewers: haobo, dhruba, zshao

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10833

5679107b

11 6月, 2013 6 次提交

Do not submit multiple simultaneous seek-compaction requests. · e673d5d2

由 Dhruba Borthakur 提交于 6月 03, 2013

Summary:
The code was such that if multi-threaded-compactions as well
as seek compaction are enabled then it submits multiple
compaction request for the same range of keys. This causes
extraneous sst-files to accumulate at various levels.

Test Plan:
I am not able to write a very good unit test for this one
but can easily reproduce this bug with 'dbstress' with the
following options.

batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0; t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch

Reviewers: leveldb, emayanke

Reviewed By: emayanke

Differential Revision: https://reviews.facebook.net/D11055

e673d5d2

Make Write API work for TTL databases · 3c35eda9

由 Mayank Agarwal 提交于 5月 15, 2013

Summary: Added logic to make another WriteBatch with Timestamps during the Write function execution in TTL class. Also expanded the ttl_test to test for it. Have done nothing for Merge for now.

Test Plan: make ttl_test;./ttl_test

Reviewers: haobo, vamsi, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10827

3c35eda9

Fix refering freed memory in earlier commit. · 1b69f1e5

由 Dhruba Borthakur 提交于 6月 10, 2013

Summary: Fix refering freed memory in earlier commit by https://reviews.facebook.net/D11181

Test Plan: make check

Reviewers: haobo, sheki

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11193

1b69f1e5

[Rocksdb] fix wrong assert · 4a8554d5

由 Abhishek Kona 提交于 6月 10, 2013

Summary: the assert was wrong in D11145. Broke build

Test Plan: make db_bench run it

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11187

4a8554d5

Print name of user comparator in LOG. · c5de1b93

由 Dhruba Borthakur 提交于 6月 10, 2013

Summary:
The current code prints the name of the InternalKeyComparator
in the log file. We would also like to print the name of the
user-specified comparator for easier debugging.

Test Plan: make check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11181

c5de1b93

[rocksdb] names for all metrics provided in statistics.h · a4913c51

由 Abhishek Kona 提交于 6月 10, 2013

Summary: Provide a  map of histograms and ticker vs strings. Fb303 libraries can use this to provide the mapping. We will not have to duplicate the code during release.

Test Plan: db_bench with statistics=1

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11145

a4913c51

10 6月, 2013 2 次提交

Max_mem_compaction_level can have maximum value of num_levels-1 · 184343a0

由 Mayank Agarwal 提交于 6月 08, 2013

Summary:
Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed:
1. WriteLevel0Table was called when memtable was to be flushed for a file.
2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed.
3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0.
The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct.

Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11157

184343a0

Modifying options to db_stress when it is run with db_crashtest · 7a6bd8e9

由 Mayank Agarwal 提交于 6月 08, 2013

Summary: These extra options caught some bugs. Will be run via Jenkins now with the crash_test

Test Plan: ./make crashtest

Reviewers: dhruba, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11151

7a6bd8e9

08 6月, 2013 2 次提交

[Fix whilebox crash test failure] · 3bb94499

由 Vamsi Ponnekanti 提交于 6月 07, 2013

Summary:
I think the check for "error" that I added had caused
false alarm. Fixed that.

Test Plan:
Revert Plan: OK

Task ID: #

Reviewers: emayanke, dhruba

Reviewed By: emayanke

Differential Revision: https://reviews.facebook.net/D11139

3bb94499

[Rocksdb] measure table open io in a histogram · e982b5a4

由 Abhishek Kona 提交于 6月 07, 2013

Summary: as title

Test Plan: db_bench --statistics=1 check for statistic.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11109

e982b5a4

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功