提交 · fce5994603c7e32021ef614b4afb47e057b8a248 · kvdb / rocksdb

08 11月, 2018 1 次提交

Add more sync point to fix flaky test GroupCommitTest · fce59946

由 Zhongyi Xie 提交于 11月 07, 2018

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4637

Differential Revision: D12963727

Pulled By: miasantreble

fbshipit-source-id: 76053501afbecc6ef388ddc56542fa0185243e3f

fce59946

10 10月, 2018 1 次提交

Handle mixed slowdown/no_slowdown writer properly (#4475) · 854a4be0

由 Anand Ananthabhotla 提交于 10月 09, 2018

Summary:
There is a bug when the write queue leader is blocked on a write
delay/stop, and the queue has writers with WriteOptions::no_slowdown set
to true. They are not woken up until the write stall is cleared.

The fix introduces a dummy writer inserted at the tail to indicate a
write stall and prevent further inserts into the queue, and a condition
variable that writers who can tolerate slowdown wait on before adding
themselves to the queue. The leader calls WriteThread::BeginWriteStall()
to add the dummy writer and then walk the queue to fail any writers with
no_slowdown set. Once the stall clears, the leader calls
WriteThread::EndWriteStall() to remove the dummy writer and signal the
condition variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475

Differential Revision: D10285827

Pulled By: anand1976

fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9

854a4be0

18 7月, 2018 1 次提交

Fix write get stuck when pipelined write is enabled (#4143) · d538ebdf

由 Yi Wu 提交于 7月 17, 2018

Summary:
Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last).

T1: all writers pending in WAL writer queue:
WAL writer queue: W1, W2, W3, W4
memtable writer queue: empty

T2. W1 finish WAL writer and move to memtable writer queue:
WAL writer queue: W2, W3, W4,
memtable writer queue: W1

T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader.
WAL writer queue: W4
memtable writer queue: W1, W2, W3

T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2.
WAL writer queue: W4
memtable writer queue: empty

T5. The thread corresponding to W3 create another writer W3' with the same address as W3.
WAL writer queue: W4, W3'
memtable writer queue: empty

T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up.

The issue exists since pipelined write was introduced in 5.5.0.

Closes #3704
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143

Differential Revision: D8871599

Pulled By: yiwu-arbug

fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a

d538ebdf

23 5月, 2018 1 次提交

Avoid sleep in DBTest.GroupCommitTest to fix flakiness · 7db721b9

由 Andrew Kryczka 提交于 5月 22, 2018

Summary:
DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
Closes https://github.com/facebook/rocksdb/pull/3883

Differential Revision: D8079429

Pulled By: ajkr

fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049

7db721b9

24 4月, 2018 1 次提交

Improve write time breakdown stats · affe01b0

由 Mike Kolupaev 提交于 4月 23, 2018

Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.

These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602

Differential Revision: D7251562

Pulled By: al13n321

fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d

affe01b0

06 3月, 2018 1 次提交

Comment out unused variables · 5d68243e

由 Andrew Kryczka 提交于 3月 05, 2018

Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e

5d68243e

23 2月, 2018 2 次提交

I
Back out "[codemod] - comment out unused parameters" · aba34097
由 Igor Sugak 提交于 2月 22, 2018
```
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
```
aba34097

- comment out unused parameters · f4a030ce

由 David Lai 提交于 2月 22, 2018

Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461

f4a030ce

29 11月, 2017 1 次提交

Fix IOError on WAL write doesn't propagate to write group follower · 3cf562be

由 Yi Wu 提交于 11月 28, 2017

Summary:
This is a simpler version of #3097 by removing all unrelated changes.

Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
Closes https://github.com/facebook/rocksdb/pull/3201

Differential Revision: D6421644

Pulled By: yiwu-arbug

fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e

3cf562be

29 9月, 2017 1 次提交

WritePrepared Txn: Recovery · 385049ba

由 Maysam Yabandeh 提交于 9月 28, 2017

Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901

Differential Revision: D5859596

Pulled By: maysamyabandeh

fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0

385049ba

04 8月, 2017 1 次提交

Fix the overflow bug in AwaitState · 58410aee

由 Maysam Yabandeh 提交于 8月 03, 2017

Summary:
https://github.com/facebook/rocksdb/issues/2559 reports an overflow in AwaitState. nbronson has debugged the issue and presented the fix, which is applied to this patch. Moreover this patch adds more comments to clarify the logic in AwaitState.

I tried with both 16 and 64 threads on update benchmark. The fix lowers cpu usage by 1.6 but also lowers the throughput by 1.6 and 2% respectively. Apparently the bug had favored using the spinning more often.

Benchmarks:
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --benchmarks="fillrandom" --threads=16 --num=2000000
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --use_existing_db=1 --benchmarks="updaterandom[X3]" --threads=16 --num=2000000
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --use_existing_db=1 --benchmarks="updaterandom[X3]" --threads=64 --num=200000

Results
$ cat update-16t-bug.txt | tail -4
updaterandom [AVG    3 runs] : 234117 ops/sec;   51.8 MB/sec
updaterandom [MEDIAN 3 runs] : 233581 ops/sec;   51.7 MB/sec
3896.42user 1539.12system 6:50.61elapsed 1323%CPU (0avgtext+0avgdata 331308maxresident)k
0inputs+0outputs (0major+1281001minor)pagefaults 0swaps
$ cat update-16t-fixed.txt | tail -4
updaterandom [AVG    3 runs] : 230364 ops/sec;   51.0 MB/sec
updaterandom [MEDIAN 3 runs] : 226169 ops/sec;   50.0 MB/sec
3865.46user 1568.32system 6:57.63elapsed 1301%CPU (0avgtext+0avgdata 315012maxresident)k
0inputs+0outputs (0major+1342568minor)pagefaults 0swaps

$ cat update-64t-bug.txt | tail -4
updaterandom [AVG    3 runs] : 261878 ops/sec;   57.9 MB/sec
updaterandom [MEDIAN 3 runs] : 262859 ops/sec;   58.2 MB/sec
926.27user 578.06system 2:27.46elapsed 1020%CPU (0avgtext+0avgdata 475480maxresident)k
0inputs+0outputs (0major+1058728minor)pagefaults 0swaps
$ cat update-64t-fixed.txt | tail -4
updaterandom [AVG    3 runs] : 256699 ops/sec;   56.8 MB/sec
updaterandom [MEDIAN 3 runs] : 256380 ops/sec;   56.7 MB/sec
933.47user 575.37system 2:30.41elapsed 1003%CPU (0avgtext+0avgdata 482340maxresident)k
0inputs+0outputs (0major+1078557minor)pagefaults 0swaps
Closes https://github.com/facebook/rocksdb/pull/2679

Differential Revision: D5553732

Pulled By: maysamyabandeh

fbshipit-source-id: 98b72dc3a8e0f22ea29d4f7c7790af10c369c5bb

58410aee

26 7月, 2017 1 次提交

Fix flaky write_callback_test · fe1a5559

由 Yi Wu 提交于 7月 25, 2017

Summary:
The test is failing occasionally on the assert: `ASSERT_TRUE(writer->state == WriteThread::State::STATE_INIT)`. This is because the test don't make the leader wait for long enough before updating state for its followers. The patch move the update to `threads_waiting` to the end of `WriteThread::JoinBatchGroup:Wait` callback to avoid this happening.

Also adding `WriteThread::JoinBatchGroup:Start` and have each thread wait there while another thread is linking to the linked-list. This is to make the check of `is_leader` more deterministic.

Also changing two while-loops of `compare_exchange_strong` to plain `fetch_add`, to make it look cleaner.
Closes https://github.com/facebook/rocksdb/pull/2640

Differential Revision: D5491525

Pulled By: yiwu-arbug

fbshipit-source-id: 6e897f122082bd6f98e6d51b31a25e5fd0a3fb82

fe1a5559

22 7月, 2017 2 次提交

Revert "comment out unused parameters" · 72502cf2

由 Sagar Vemuri 提交于 7月 21, 2017

Summary:
This reverts the previous commit 1d7048c5, which broke the build.

Did a `git revert 1d7048c5`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06

72502cf2

comment out unused parameters · 1d7048c5

由 Victor Gao 提交于 7月 21, 2017

Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2

1d7048c5

19 7月, 2017 1 次提交

Moving static AdaptationContext to outside function · 36651d14

由 Maysam Yabandeh 提交于 7月 18, 2017

Summary:
Moving static AdaptationContext to outside function to bypass tsan's false report with static initializers.

It is because with optimization enabled std::atomic is simplified to as a simple read with no locks. The existing lock produced by static initializer is __cxa_guard_acquire which is apparently not understood by tsan as it is different from normal locks (__gthrw_pthread_mutex_lock).

This is a known problem with tsan:
https://stackoverflow.com/questions/27464190/gccs-tsan-reports-a-data-race-with-a-thread-safe-static-local
https://stackoverflow.com/questions/42062557/c-multithreading-is-initialization-of-a-local-static-lambda-thread-safe

A workaround that I tried was to move the static variable outside the function. It is not a good coding practice since it gives global visibility to variable but it is a hackish workaround until g++ tsan is improved.
Closes https://github.com/facebook/rocksdb/pull/2598

Differential Revision: D5445281

Pulled By: yiwu-arbug

fbshipit-source-id: 6142bd934eb5852d8fd7ce027af593ba697ed41d

36651d14

16 7月, 2017 1 次提交

Change RocksDB License · 3c327ac2

由 Siying Dong 提交于 7月 15, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75

3c327ac2

20 5月, 2017 1 次提交

New WriteImpl to pipeline WAL/memtable write · 07bdcb91

由 Yi Wu 提交于 5月 19, 2017

Summary:
PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.

Merging #2056 and #2058 into this PR.
Closes https://github.com/facebook/rocksdb/pull/2286

Differential Revision: D5054606

Pulled By: yiwu-arbug

fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d

07bdcb91

17 5月, 2017 1 次提交

fixed typo · f720796e

由 hyunwoo 提交于 5月 16, 2017

Summary:
fixed exisitng -> existing
Closes https://github.com/facebook/rocksdb/pull/2305

Differential Revision: D5070169

Pulled By: yiwu-arbug

fbshipit-source-id: 8c8450acf50757b767cf78b78314018395738d96

f720796e

28 4月, 2017 1 次提交

Add GPLv2 as an alternative license. · d616ebea

由 Siying Dong 提交于 4月 27, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4

d616ebea

14 4月, 2017 1 次提交

Simplify write thread logic · e9e6e532

由 Yi Wu 提交于 4月 13, 2017

Summary:
The concept about early exit in write thread implementation is a confusing one. It means that if early exit is allowed, batch group leader will not responsible to exit the batch group, but the last finished writer do. In case we need to mark log synced, or encounter memtable insert error, early exit is disallowed.

This patch remove such a concept by:
* In all cases, the last finished writer (not necessary leader) is responsible to exit batch group.
* In case of parallel memtable write, leader will also mark log synced after memtable insert and before signal finish (call `CompleteParallelWorker()`). The purpose is to allow mark log synced (which require locking mutex) can run in parallel to memtable insert in other writers.
* The last finish writer should handle memtable insert error (update bg_error_) before exiting batch group.
Closes https://github.com/facebook/rocksdb/pull/2134

Differential Revision: D4869667

Pulled By: yiwu-arbug

fbshipit-source-id: aec170847c85b90f4179d6a4608a4fe1361544e3

e9e6e532

11 4月, 2017 1 次提交

Adding comments to the write path · 20778f2f

由 Maysam Yabandeh 提交于 4月 10, 2017

Summary:
also did minor refactoring
Closes https://github.com/facebook/rocksdb/pull/2115

Differential Revision: D4855818

Pulled By: maysamyabandeh

fbshipit-source-id: fbca6ac57e5c6677fffe8354f7291e596a50cb77

20778f2f

07 4月, 2017 1 次提交

Move memtable related files into memtable directory · df6f5a37

由 Yi Wu 提交于 4月 06, 2017

Summary:
Move memtable related files into memtable directory.
Closes https://github.com/facebook/rocksdb/pull/2087

Differential Revision: D4829242

Pulled By: yiwu-arbug

fbshipit-source-id: ca70ab6

df6f5a37

05 4月, 2017 1 次提交

Refactor WriteImpl (pipeline write part 1) · 9e445318

由 Yi Wu 提交于 4月 04, 2017

Summary:
Refactor WriteImpl() so when I plug-in the pipeline write code (which is
an alternative approach for WriteThread), some of the logic can be
reuse. I split out the following methods from WriteImpl():

* PreprocessWrite()
* HandleWALFull() (previous MaybeFlushColumnFamilies())
* HandleWriteBufferFull()
* WriteToWAL()

Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h.
No real logic change in this patch.
Closes https://github.com/facebook/rocksdb/pull/2042

Differential Revision: D4781014

Pulled By: yiwu-arbug

fbshipit-source-id: d45ca18

9e445318

18 1月, 2017 1 次提交

Fix 2PC with concurrent memtable insert · 77b48066

由 Siying Dong 提交于 1月 17, 2017

Summary:
If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly.
Closes https://github.com/facebook/rocksdb/pull/1730

Differential Revision: D4371081

Pulled By: siying

fbshipit-source-id: cd40c6d

77b48066

02 12月, 2016 1 次提交

Bug: paralle_group status updated in WriteThread::CompleteParallelWorker · b77007df

由 fangchenliaohui 提交于 12月 01, 2016

Summary:
Multi-write thread may update the status of the parallel_group in
WriteThread::CompleteParallelWorker if the status of Writer is not ok!
When copy write status to the paralle_group, the write thread just hold the
mutex of the the writer processed by itself. it is useless. The thread
should held the the leader of the parallel_group instead.
Closes https://github.com/facebook/rocksdb/pull/1598

Differential Revision: D4252335

Pulled By: siying

fbshipit-source-id: 3864cf7

b77007df

22 11月, 2016 1 次提交

Add WriteOptions.no_slowdown · 182b940e

由 Maysam Yabandeh 提交于 11月 21, 2016

Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527

Differential Revision: D4191405

Pulled By: maysamyabandeh

fbshipit-source-id: 7f3ce3f

182b940e

16 8月, 2016 1 次提交

WriteBatch support for range deletion · 3771e379

由 Andrew Kryczka 提交于 8月 16, 2016

Summary:
Add API to WriteBatch to store range deletions in its buffer
which are later added to memtable. In the WriteBatch buffer, a range
deletion is encoded as "<optype><CF ID (optional)><begin key><end key>".

With this diff, the range tombstones are stored inline with the data in
the memtable. It's useful for now because the test cases rely on the
data being accessible via memtable. My next step is to store range
tombstones in a separate area in the memtable.

Test Plan: unit tests

Reviewers: IslamAbdelRahman, sdong, wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61401

3771e379

10 2月, 2016 1 次提交
- B
  
  Updated all copyright headers to the new format. · 21e95811
  由 Baraa Hamodi 提交于 2月 09, 2016
  
  21e95811
06 2月, 2016 1 次提交

Improve perf of Pessimistic Transaction expirations (and optimistic transactions) · 6f71d3b6

由 reid horuff 提交于 2月 05, 2016

Summary:
copy from task 8196669:

1) Optimistic transactions do not support batching writes from different threads.
2) Pessimistic transactions do not support batching writes if an expiration time is set.

In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write. But we could support batching write operations with callbacks if we make sure to process the callbacks correctly.

To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together. Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer.

Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly.

Reviewers: hermanlee4, anthony, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52863

6f71d3b6

31 12月, 2015 1 次提交

use -Werror=missing-field-initializers, to closer match MyRocks build · ac16663b

由 Nathan Bronson 提交于 12月 30, 2015

Summary:
myrocks seems to build rocksdb using
-Wmissing-field-initializers (and treats warnings as errors).  This diff
adds that flag to the rocksdb build, and fixes the compilation failures
that result.  I have not checked for any other differences in the build
flags for rocksdb build as part of myrocks.

Test Plan: make check

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52443

ac16663b

29 12月, 2015 1 次提交

Fix CLANG errors introduced by · 11672df1

由 sdong 提交于 12月 28, 2015

Summary: Fix some CLANG errors introduced in 7d87f027

Test Plan: Build with both of CLANG and gcc

Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52329

11672df1

26 12月, 2015 1 次提交

support for concurrent adds to memtable · 7d87f027

由 Nathan Bronson 提交于 8月 14, 2015

Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589

7d87f027

09 12月, 2015 1 次提交

Resubmit the fix for a race condition in persisting options · 774b80e9

由 Yueh-Hsuan Chiang 提交于 12月 08, 2015

Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717

774b80e9

08 12月, 2015 2 次提交

S
Revert "Fix a race condition in persisting options" · f307036b
由 sdong 提交于 12月 07, 2015
```
This reverts commit 2fa3ed51. It breaks RocksDB lite build
```
f307036b

Fix a race condition in persisting options · 2fa3ed51

由 Yueh-Hsuan Chiang 提交于 12月 07, 2015

Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51609

2fa3ed51

07 11月, 2015 1 次提交

incorrect batch group size computation for write throttling · 2b42000f

由 Nathan Bronson 提交于 11月 02, 2015

Summary:
When a write batch can't join a batch group due to the total
size of the contained batches, the write controller's GetDelay is passed
a size value that includes the rejected batch.

Test Plan: make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50343

2b42000f

15 8月, 2015 1 次提交

reduce db mutex contention for write batch groups · b7198c3a

由 Nathan Bronson 提交于 8月 05, 2015

Summary:
This diff allows a Writer to join the next write batch group
without acquiring any locks. Waiting is performed via a per-Writer mutex,
so all of the non-leader writers never need to acquire the db mutex.
It is now possible to join a write batch group after the leader has been
chosen but before the batch has been constructed. This diff doesn't
increase parallelism, but reduces synchronization overheads.

For some CPU-bound workloads (no WAL, RAM-sized working set) this can
substantially reduce contention on the db mutex in a multi-threaded
environment. With T=8 N=500000 in a CPU-bound scenario (see the test
plan) this is good for a 33% perf win. Not all scenarios see such a
win, but none show a loss. This code is slightly faster even for the
single-threaded case (about 2% for the CPU-bound scenario below).

Test Plan:
1. unit tests
2. COMPILE_WITH_TSAN=1 make check
3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Reviewers: sdong, igor, rven, ljin, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43887

b7198c3a

14 7月, 2015 1 次提交

Deprecate WriteOptions::timeout_hint_us · 5aea98dd

由 Igor Canadi 提交于 7月 14, 2015

Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.

Test Plan: make check

Reviewers: anthony, kradhakrishnan, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41937

5aea98dd

12 6月, 2015 1 次提交

Slow down writes by bytes written · 7842920b

由 sdong 提交于 5月 15, 2015

Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.

The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work

hard_rate_limit is deprecated.

options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.

Test Plan: Add new unit tests in db_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor

Reviewed By: igor

Subscribers: ikabiljo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36351

7842920b

30 5月, 2015 1 次提交

Optimistic Transactions · dc9d70de

由 agiardullo 提交于 5月 29, 2015

Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435

dc9d70de

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功