提交 · 0acc7388101c7f0c043d1dc961238d1d44ee9971 · kvdb / rocksdb

22 12月, 2014 3 次提交

由 Igor Canadi 提交于 12月 22, 2014

Summary:
There are two versions of FindObsoleteFiles():
* full scan, which is executed every 6 hours (and it's terribly slow)
* no full scan, which is executed every time a background process finishes and iterator is deleted

This diff is optimizing the second case (no full scan). Here's what we do before the diff:
* Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
* Get the list of live files to avoid deleting files that are live.
* Delete files that are in obsolete_files and not in live_files.

After this diff:
* The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
* Get the list of obsolete files (which exclude moved files).
* No need to get the list of live files, since all files in obsolete_files need to be deleted.

I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123

This depends on D30123.

P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.

Test Plan:
One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.

make check

Big number of compactions and flushes:

  ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30249

0acc7388

I
Merge pull request #442 from alabid/alabid/fix-example-typo · d8c4ce6b
由 Igor Canadi 提交于 12月 22, 2014
```
fix really trivial typo in column families example
```
d8c4ce6b
A

fix really trivial typo · 949bd71f
由 alabid 提交于 12月 22, 2014

949bd71f

21 12月, 2014 1 次提交

Fix a SIGSEGV in BackgroundFlush · f8999fcf

由 Igor Canadi 提交于 12月 21, 2014

Summary:
This one wasn't easy to find :)

What happens is we go through all cfds on flush_queue_ and find no cfds to flush, *but* the cfd is set to the last CF we looped through and following code assumes we want it flushed.

BTW @sdong do you think we should also make BackgroundFlush() only check a single cfd for flushing instead of doing this `while (!flush_queue_.empty())`?

Test Plan: regression test no longer fails

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30591

f8999fcf

20 12月, 2014 3 次提交

MultiGet for DBWithTTL · ade4034a

由 Igor Canadi 提交于 12月 20, 2014

Summary: This is a feature request from rocksdb's user. I didn't even realize we don't support multigets on TTL DB :)

Test Plan: added a unit test

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30561

ade4034a

Rewritten system for scheduling background work · fdb6be4e

由 Igor Canadi 提交于 12月 19, 2014

Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.

The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.

Here are the performance results:

Command:

    ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000  --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333

Before the patch:

     fillrandom   :      26.950 micros/op 37105 ops/sec;    4.1 MB/s

After the patch:

      fillrandom   :      17.404 micros/op 57456 ops/sec;    6.4 MB/s

Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:

      fillrandom   :       7.590 micros/op 131758 ops/sec;   14.6 MB/s

Test Plan:
make check

two stress tests:

Big number of compactions and flushes:

    ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

max_background_flushes=0, to verify that this case also works correctly

    ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30123

fdb6be4e

I

Remove -mtune=native because it's redundant · a3001b1d
由 Igor Canadi 提交于 12月 19, 2014

a3001b1d

19 12月, 2014 8 次提交
- Y
  Merge pull request #437 from fyrz/RocksJava-SliceTests-Fixes · e27c8452
  由 Yueh-Hsuan Chiang 提交于 12月 18, 2014
```
[RocksJava] Slice / DirectSlice improvements
```
  e27c8452
- F
  
  [RocksJava] Incorporated changes D30081 · 1fed1282
  由 fyrz 提交于 12月 18, 2014
  
  1fed1282
- F
  
  [RocksJava] JavaDoc correction · 5b9ceef0
  由 fyrz 提交于 12月 18, 2014
  
  5b9ceef0
- F
  
  [RocksJava] Incorporated changes D30081 · 5fbba60b
  由 fyrz 提交于 12月 18, 2014
  
  5fbba60b
- F
  
  [RocksJava] Incorporate additions for D30081 · b0230d7e
  由 fyrz 提交于 12月 14, 2014
  
  b0230d7e
- F
  [RocksJava] Slice / DirectSlice improvements · b015ed0c
  由 fyrz 提交于 12月 10, 2014
```
Summary:
- AssertionError when initialized with Non-Direct Buffer
- Tests + coverage for DirectSlice
- Slice sigsegv fixes when initializing from String and byte arrays
- Slice Tests

Test Plan: Run tests without source modifications.

Reviewers: yhchiang, adamretter, ankgup87

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D30081
```
  b015ed0c
- Y
  Merge pull request #430 from adamretter/increase-parallelism · 4d422db0
  由 Yueh-Hsuan Chiang 提交于 12月 18, 2014
```
Added setIncreaseParallelism() to Java API Options
```
  4d422db0
- Y
  Merge pull request #411 from fyrz/RocksJava-RangeCompaction · 04c4e496
  由 Yueh-Hsuan Chiang 提交于 12月 18, 2014
```
[RocksJava] Range compaction
```
  04c4e496
18 12月, 2014 3 次提交
- I
  Merge pull request #427 from haneefmubarak/c-examples · 62d19b7b
  由 Igor Canadi 提交于 12月 18, 2014
```
C example
```
  62d19b7b
- H
  
  style fixes in c example · 28424d73
  由 Haneef Mubarak 提交于 12月 18, 2014
  
  28424d73
- V
  Handle errors during pthread calls · 7198ed5a
  由 Venkatesh Radhakrishnan 提交于 12月 17, 2014
```
Summary: Release locks before calling exit.

Test Plan: Force errors in debugger and verify correctness

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30423
```
  7198ed5a
17 12月, 2014 3 次提交

H

error detection and memory leaks in c example · 91c58752
由 Haneef Mubarak 提交于 12月 17, 2014

91c58752

Avoid unnecessary unlock and lock mutex when notifying events. · 25f70a5a

由 Yueh-Hsuan Chiang 提交于 12月 16, 2014

Summary: Avoid unnecessary unlock and lock mutex when notifying events.

Test Plan: ./listener_test

Reviewers: igor

Reviewed By: igor

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30267

25f70a5a

Move the file copy out of the mutex. · 7661e5a7

由 Venkatesh Radhakrishnan 提交于 12月 16, 2014

Summary:
We now release the mutex before copying the files in the case
of the trivial move. This path does not use the compaction job.

Test Plan: DBTest.LevelCompactionThirdPath

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30381

7661e5a7

16 12月, 2014 6 次提交

A

Rudimentary test cases for setIncreaseParallelism · 17e84f21
由 Adam Retter 提交于 12月 16, 2014

17e84f21
A
Exposed IncreasedParallelism option to Java API as · eda0dcdd
由 Adam Retter 提交于 12月 05, 2014
```
setIncreasedParallelism
```
eda0dcdd
F

[RocksJava] Incorporated changes for D29283 · efc94ceb
由 fyrz 提交于 12月 07, 2014

efc94ceb

[RocksJava] CompactRange support · 69188ff4

由 fyrz 提交于 11月 20, 2014

Summary: Manual range compaction support in RocksJava.

Test Plan:
make rocksdbjava
make jtest
mvn -f rocksjni.pom package

Reviewers: adamretter, yhchiang, ankgup87

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D29283

69188ff4

F
[RocksJava] CompactRange support · 48adce77
由 fyrz 提交于 11月 17, 2014
```
- manual range compaction support in RocksJava
```
48adce77

RocksDB: Allow Level-Style Compaction to Place Files in Different Paths · 153f4f07

由 Venkatesh Radhakrishnan 提交于 12月 15, 2014

Summary:
Allow Level-style compaction to place files in different paths
This diff provides the code for task 4854591. We now support level-compaction
to place files in different paths by specifying  them in db_paths  along with
the minimum level for files to store in that path.

Test Plan: ManualLevelCompactionOutputPathId in db_test.cc

Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29799

153f4f07

15 12月, 2014 1 次提交

Optimize default compile to compilation platform by default · 06eed650

由 Igor Canadi 提交于 12月 15, 2014

Summary:
This diff changes compile to optimize for native platform by default. This will automatically turn on crc32 optimizations for modern processors, which greatly improves rocksdb's performance.

I also did some more changes to compilation documentation.

Test Plan:
compile with `make`, observe -march=native
compile with `PORTABLE=1 make`, observe no -march=native

Reviewers: sdong, rven, yhchiang, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30225

06eed650

13 12月, 2014 2 次提交

Added 'dump_live_files' command to ldb tool. · cef6f843

由 Qiao Yang 提交于 11月 24, 2014

Summary:
Priliminary diff to solicit comments.
Given DB path, dump all SST files (key/value and properties), WAL file and manifest
files. What command options do we need to support for this command? Maybe
output_hex for keys?

Test Plan: Create additional ldb unit tests.

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D29547

cef6f843

Add an assert and avoid std::sort(autovector) to investigate an ASAN issue · 7ab1526c

由 sdong 提交于 12月 11, 2014

Summary:
ASAN build fails once for this error:

14:04:52 ==== Test DBTest.CompactFilesOnLevelCompaction
14:04:52 db_test: db/version_set.cc:1062: void rocksdb::VersionStorageInfo::AddFile(int, rocksdb::FileMetaData*): Assertion `level <= 0 || level_files->empty() || internal_comparator_->Compare( (*level_files)[level_files->size() - 1]->largest, f->smallest) < 0' failed.

Not abling figure out reason. We use std:vector for sorting for save and add one more assert to help figure out whether it is the sorting's problem.

Test Plan: make all check

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D30117

7ab1526c

12 12月, 2014 2 次提交

Y
Fix Mac compile errors on util/cache_test.cc · 74b3fb6d
由 Yueh-Hsuan Chiang 提交于 12月 11, 2014
```
Summary:
Fix Mac compile errors on util/cache_test.cc

Test Plan:
make dbg -j32
./cache_test
```
74b3fb6d

Improve scalability of DB::GetSnapshot() · d7a48666

由 sdong 提交于 12月 10, 2014

Summary: Now DB::GetSnapshot() doesn't scale to more column families, as it needs to go through all the column families to find whether snapshot is supported. This patch optimizes it.

Test Plan:
Add unit tests to cover negative cases.
make all check

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30093

d7a48666

11 12月, 2014 3 次提交

Modifed the LRU cache eviction code so that it doesn't evict blocks which have exteranl references · ee95cae9

由 Alexey Maykov 提交于 10月 21, 2014

Summary:
Currently, blocks which have more than one reference (ie referenced by something other than cache itself) are evicted from cache. This doesn't make much sense:
- blocks are still in RAM, so the RAM usage reported by the cache is incorrect
- if the same block is needed by another iterator, it will be loaded and decompressed again

This diff changes the reference counting scheme a bit. Previously, if the cache contained the block, this was accounted for in its refcount. After this change, the refcount is only used to track external references. There is a boolean flag which indicates whether or not the block is contained in the cache.
This diff also changes how LRU list is used. Previously, both hashtable and the LRU list contained all blocks. After this change, the LRU list contains blocks with the refcount==0, ie those which can be evicted from the cache.

Note that this change still allows for cache to grow beyond its capacity. This happens when all blocks are pinned (ie refcount>0). This is consistent with the current behavior. The cache's insert function never fails. I spent lots of time trying to make table_reader and other places work with the insert which might failed. It turned out to be pretty hard. It might really destabilize some customers, so finally, I decided against doing this.

table_cache_remove_scan_count_limit option will be unneeded after this change, but I will remove it in the following diff, if this one gets approved

Test Plan: Ran tests, made sure they pass

Reviewers: sdong, ljin

Differential Revision: https://reviews.facebook.net/D25503

ee95cae9

VersionBuilder to use unordered set and map to store added and deleted files · 0ab0242f

由 sdong 提交于 12月 09, 2014

Summary: Set operations in VerisonBuilder is shown as a performance bottleneck of restarting DB when there are lots of files. Make both of added_files and deleted_files use unordered set or map. Only when adding the files, sort the added files.

Test Plan: make all check

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: hermanlee4, leveldb, dhruba, ljin

Differential Revision: https://reviews.facebook.net/D30051

0ab0242f

add range scan test to benchmark script · e93f044d

由 Lei Jin 提交于 12月 10, 2014

Summary: as title

Test Plan: ran it

Reviewers: yhchiang, igor, sdong, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D25563

e93f044d

10 12月, 2014 1 次提交

Fix #434 · cb82d7b0

由 Igor Canadi 提交于 12月 09, 2014

Summary: Why do we assert here? This doesn't seem like user friendly thing to do :)

Test Plan: none

Reviewers: sdong, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30027

cb82d7b0

09 12月, 2014 3 次提交

Fix calculation of max_total_wal_size in db_options_.max_total_wal_size == 0 case · 046ba7d4

由 sdong 提交于 12月 08, 2014

Summary: This is a regression bug introduced by https://reviews.facebook.net/D24729 . max_total_wal_size would be off the target it should be more and more in the case that the a user holds the current super version after flush or compaction. This patch fixes it

Test Plan: make all check

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: ljin, yoshinorim, MarkCallaghan, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29961

046ba7d4

Y

Update HISTORY.md for release 3.9 · 1b7fbb9e
由 Yueh-Hsuan Chiang 提交于 12月 08, 2014

1b7fbb9e

Fix problem with create_if_missing option when wal_dir is used · 635c61fd

由 Leonidas Galanis 提交于 12月 08, 2014

Summary: When wal_dir is used, DestroyDB is not passed the wal_dir option and so we get a Corruption exception.

Test Plan:
Verified manually that the following command line works now:
./db_bench --db=/mnt/db/rocksdb ... --disable_wal=0 --wal_dir=/data/users/rocksdb/WAL... --benchmarks=filluniquerandom --use_existing_db=0...

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29859

635c61fd

06 12月, 2014 1 次提交
- Y
  Merge pull request #422 from fyrz/RocksJava-Quality-Improvements · 2871bc7b
  由 Yueh-Hsuan Chiang 提交于 12月 05, 2014
```
Rocks java quality improvements
```
  2871bc7b

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功