提交 · 7d87f02799bd0a8fd36df24fab5baa4968615c86 · kvdb / rocksdb

26 12月, 2015 2 次提交

support for concurrent adds to memtable · 7d87f027

由 Nathan Bronson 提交于 8月 14, 2015

Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589

7d87f027

DBTest.HardLimit use special memtable · 5b2587b5

由 sdong 提交于 12月 23, 2015

Summary: DBTest.HardLimit fails in appveyor build. Use special mem table to make the test behavior depends less on platform

Test Plan: Run the test with JEMALLOC both on and off.

Reviewers: yhchiang, kradhakrishnan, rven, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52317

5b2587b5

24 12月, 2015 13 次提交

S
Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata · 298ba27a
由 Siying Dong 提交于 12月 23, 2015
```
Enable MS compiler warning c4244.
```
298ba27a
S
Merge pull request #899 from zhipeng-jia/fix_clang_warning · 7810aa80
由 Siying Dong 提交于 12月 23, 2015
```
Fix clang warnings
```
7810aa80
S
Merge pull request #895 from zhipeng-jia/develop · 4c5560d7
由 Siying Dong 提交于 12月 23, 2015
```
Fix computation of size of last sub-compaction
```
4c5560d7

DBTest.DelayedWriteRate: fix assert of sign and unsign comparison · d43da8ae

由 sdong 提交于 12月 23, 2015

Summary: DBTest.DelayedWriteRate has sign and unsign comparisons that break Windows build. Fix it.

Test Plan: Build and run the test modified.

Reviewers: IslamAbdelRahman, rven, anthony, yhchiang, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52311

d43da8ae

Fix warning in release · 3280ae9a

由 sdong 提交于 12月 23, 2015

Summary: Warning in release build.

Test Plan: Make release and make all

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52305

3280ae9a

Z

Fix clang compile error under Linux · ec2664fe
由 Zhipeng Jia 提交于 12月 24, 2015

ec2664fe

Update liblz4 to r131 · 9c176ef9

由 Andrew Kryczka 提交于 12月 23, 2015

Summary:
It was already built in third-party2 but the include/library paths in
rocksdb hadn't been updated accordingly.

Test Plan:
verified build works

  $ make clean && make -j32 all
  $ make clean && USE_CLANG=1 make -j32 all

Reviewers: cyan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52299

9c176ef9

Change default options.delayed_write_rate · 15b89022

由 sdong 提交于 12月 23, 2015

Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281

15b89022

Z

Fix clang warnings regarding unnecessary std::move · 73b175a7
由 Zhipeng Jia 提交于 12月 24, 2015

73b175a7

When slowdown is triggered, reduce the write rate · b9f77ba1

由 sdong 提交于 12月 17, 2015

Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131

b9f77ba1

Fix clang build · 445d5b8c

由 Andrew Kryczka 提交于 12月 23, 2015

Summary:
Missed this in https://reviews.facebook.net/D51633 because I didn't
wait for 'make commit-prereq' to finish

Test Plan: make clean && USE_CLANG=1 make -j32 all

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52275

445d5b8c

Skip bottom-level filter block caching when hit-optimized · e089db40

由 Andrew Kryczka 提交于 12月 23, 2015

Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.

- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr

Test Plan:
updated unit test:

  $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits

will also run 'make check'

Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D51633

e089db40

S
Merge pull request #898 from zhipeng-jia/fix_move_warning · 06c05495
由 Siying Dong 提交于 12月 23, 2015
```
Fix clang warning
```
06c05495

23 12月, 2015 5 次提交

Z

Fix clang warning · aa515823
由 Zhipeng Jia 提交于 12月 23, 2015

aa515823
S
Merge pull request #897 from yuslepukhin/enable_status_move · 2ba03196
由 Siying Dong 提交于 12月 22, 2015
```
Make Status moveable
```
2ba03196

Make Status moveable · dbb8260f

由 Dmitri Smirnov 提交于 12月 22, 2015

  Status is a class which is frequently returned by value from functions.
  Making it movable avoids 99% of the copies automatically
  on return by value.

dbb8260f

Fix lite_build · 2bf9b968

由 Islam AbdelRahman 提交于 12月 22, 2015

Summary: Fix compiling under ROCKSDB_LITE

Test Plan:
OPT="-DROCKSDB_LITE" make -j64 check
make check -j64

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52239

2bf9b968

Report compaction reason in CompactionListener · d005c66f

由 Islam AbdelRahman 提交于 12月 22, 2015

Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning

Test Plan:
added new tests
make check -j64

Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51975

d005c66f

22 12月, 2015 6 次提交

Z

Fix computation of size of last sub-compaction · 728f944f
由 Zhipeng Jia 提交于 12月 22, 2015

728f944f
I
Merge pull request #863 from zhangyybuaa/fix_hdfs_error · 8ac7fb83
由 Igor Canadi 提交于 12月 22, 2015
```
Fix build error with hdfs
```
8ac7fb83
I
Merge pull request #894 from zhipeng-jia/develop · e53e8219
由 Igor Canadi 提交于 12月 22, 2015
```
Sorting std::vector instead of using std::set
```
e53e8219
Z

Sorting std::vector instead of using std::set · e0abec15
由 Zhipeng Jia 提交于 12月 22, 2015

e0abec15

add call to install superversion and schedule work in enableautocompactions · 33e09c0e

由 Alex Yang 提交于 12月 09, 2015

Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121

There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.

Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.

This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.

Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000

Ensure that it no longer hangs during the test.

Reviewers: hermanlee4, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D51747

33e09c0e

S
Merge pull request #893 from zhipeng-jia/develop · 22c6b50e
由 Siying Dong 提交于 12月 21, 2015
```
Fix clang warning regarding implicit conversion
```
22c6b50e

21 12月, 2015 1 次提交
- Z
  
  Fix clang warning regarding implicit conversion · 24c7dae1
  由 Zhipeng Jia 提交于 12月 21, 2015
  
  24c7dae1
19 12月, 2015 3 次提交

Do not use timed_mutex in TransactionDB · eff30986

由 agiardullo 提交于 12月 18, 2015

Summary: Stopped using std::timed_mutex as it has known issues in older versiong of gcc. Ran into these problems when testing MongoRocks.

Test Plan: unit tests. Manual mongo testing on gcc 4.8.

Reviewers: igor, yhchiang, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52197

eff30986

R

compaction assertion triggering test fix for sequence zeroing assertion trip · 97ea8afa
由 Reid Horuff 提交于 12月 09, 2015

97ea8afa

Fix BlockBasedTableTest.BlockCacheLeak valgrind failure · 521da3ab

由 Islam AbdelRahman 提交于 12月 18, 2015

Summary:
I added this line in my previous patch D48999 (which is incorrect)
We should not release the iterator since releasing it will evict the blocks from cache

Test Plan:
Run the test under valgrind
make check

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52161

521da3ab

18 12月, 2015 7 次提交

Fix use-after free in db_bench · a4838239

由 Nathan Bronson 提交于 12月 17, 2015

Test Plan: valgrind db_bench

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52101

a4838239

I
Merge pull request #890 from zhipeng-jia/develop · bf8ffc1d
由 Igor Canadi 提交于 12月 18, 2015
```
fix typo: sr to picking_sr
```
bf8ffc1d
Z

fix typo: sr to picking_sr · 131f7ddf
由 Zhipeng Jia 提交于 12月 18, 2015

131f7ddf

db_bench: --soft_pending_compaction_bytes_limit should set... · c37729a6

由 sdong 提交于 12月 17, 2015

db_bench: --soft_pending_compaction_bytes_limit should set options.soft_pending_compaction_bytes_limit

Summary: Fix a bug that options.soft_pending_compaction_bytes_limit is not actually set with --soft_pending_compaction_bytes_limit

Test Plan: Run db_bench with this parameter and make sure the parameter is set correctly.

Reviewers: anthony, kradhakrishnan, yhchiang, IslamAbdelRahman, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52125

c37729a6

Add signalall after removing item from manual_compaction deque · 7b12ae97

由 Venkatesh Radhakrishnan 提交于 12月 17, 2015

Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.

Test Plan: ColumnFamilytTest.SameCFManualManualCommaction

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D52119

7b12ae97

Slowdown when writing to the last write buffer · d72b3177

由 sdong 提交于 12月 16, 2015

Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047

d72b3177

Add documentation for unschedFunction · 6b2a3ac9

由 Venkatesh Radhakrishnan 提交于 12月 17, 2015

Summary:
Documenting the unschedFunction parameter to Schedule as
requested by Michael Kolupaev.

Test Plan: build, unit test

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kolmike, dhruba

Differential Revision: https://reviews.facebook.net/D52089

6b2a3ac9

17 12月, 2015 3 次提交

ZSTD to use CompressionOptions.level · 167fb919

由 sdong 提交于 12月 16, 2015

Summary: Now ZSTD hard code level 1. Change it to use the compression level setting.

Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels.

Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52041

167fb919

Bump version to 4.4 · 32ff05e9

由 Islam AbdelRahman 提交于 12月 16, 2015

Summary: Bump version to 4.4

Test Plan: none

Reviewers: sdong, rven, yhchiang, anthony, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52035

32ff05e9

Introduce ReadOptions::pin_data (support zero copy for keys) · aececc20

由 Islam AbdelRahman 提交于 12月 16, 2015

Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999

aececc20

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功