提交 · 18df47b79aaee1bab0442a45caa9d73db8d6fa6f · kvdb / rocksdb

27 12月, 2013 6 次提交

Avoid malloc in NotFound key status if no message is given. · 18df47b7

由 Siying Dong 提交于 12月 26, 2013

Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691

18df47b7

K

Fix all the comparison issue in fb dev servers · b40c052b
由 Kai Liu 提交于 12月 26, 2013

b40c052b
K

Fix [-Werror=sign-compare] in autovector_test · 113a08c9
由 kailiu 提交于 12月 26, 2013

113a08c9
K

Fix the unused variable warning message in mac os · 079a21ba
由 kailiu 提交于 12月 26, 2013

079a21ba

Implement autovector · c01676e4

由 kailiu 提交于 12月 12, 2013

Summary:
A vector that leverages pre-allocated stack-based array to achieve better
performance for array with small amount of items.

Test Plan:
Added tests for both correctness and performance

Here is the performance benchmark between vector and autovector

Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below:

* no element inserted: internal array of std::vector may not really get
  initialize.
* one element inserted: internal array of std::vector must have
  initialized.
* kSize elements inserted. This shows the most time we'll spend if we
  keep everything in stack.
* 2 * kSize elements inserted. The internal vector of
  autovector must have been initialized.

Note: kSize is the capacity of autovector

  =====================================================
  Creation and Insertion Test
  =====================================================
  created 100000 vectors:
  	each was inserted with 0 elements
  	total time elapsed: 128000 (ns)
  created 100000 autovectors:
  	each was inserted with 0 elements
  	total time elapsed: 3641000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 0 elements
  	total time elapsed: 9896000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 1 elements
  	total time elapsed: 11089000 (ns)
  created 100000 autovectors:
  	each was inserted with 1 elements
  	total time elapsed: 5008000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 1 elements
  	total time elapsed: 24271000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 4 elements
  	total time elapsed: 39369000 (ns)
  created 100000 autovectors:
  	each was inserted with 4 elements
  	total time elapsed: 10121000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 4 elements
  	total time elapsed: 28473000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 8 elements
  	total time elapsed: 75013000 (ns)
  created 100000 autovectors:
  	each was inserted with 8 elements
  	total time elapsed: 18237000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 8 elements
  	total time elapsed: 42464000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 16 elements
  	total time elapsed: 102319000 (ns)
  created 100000 autovectors:
  	each was inserted with 16 elements
  	total time elapsed: 76724000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 16 elements
  	total time elapsed: 68285000 (ns)
  -----------------------------------
  =====================================================
  Sequence Access Test
  =====================================================
  performed 100000 sequence access against vector
  	size: 4
  	total time elapsed: 198000 (ns)
  performed 100000 sequence access against autovector
  	size: 4
  	total time elapsed: 306000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 8
  	total time elapsed: 565000 (ns)
  performed 100000 sequence access against autovector
  	size: 8
  	total time elapsed: 512000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 16
  	total time elapsed: 1076000 (ns)
  performed 100000 sequence access against autovector
  	size: 16
  	total time elapsed: 1070000 (ns)
  -----------------------------------

Reviewers: dhruba, haobo, sdong, chip

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14655

c01676e4

K
Merge pull request #32 from jamesgolick/master · 5643ae1a
由 Kai Liu 提交于 12月 26, 2013
```
Only try to use fallocate if it's actually present on the system.
```
5643ae1a

24 12月, 2013 1 次提交

Add a pointer to the engineering design discussion forum. · 71ddb117

由 Dhruba Borthakur 提交于 12月 23, 2013

Summary:
Add a pointer to the engineering design discussion forum.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:

71ddb117

21 12月, 2013 2 次提交

I

Initialize sequence number in BatchResult - issue #39 · b26dc956
由 Igor Canadi 提交于 12月 20, 2013

b26dc956

[RocksDB] Optimize locking for Get · 1fdb3f7d

由 Igor Canadi 提交于 12月 20, 2013

Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.

Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:

  overwrite                    56345  ->   63001
  fillseq                      193730 ->  185296
  readrandom                   771301 -> 1219803 (58% improvement!)
  readrandom_smallblockcache   677609 ->  862850
  readrandom_memtable_sst      710440 -> 1109223
  readrandom_fillunique_random 221589 ->  247869
  memtablefillrandom           105286 ->   92643
  memtablereadrandom           763033 -> 1288862

Test Plan:
make asan_check
I am also running db_stress

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14679

1fdb3f7d

20 12月, 2013 1 次提交
- I
  Merge pull request #28 from bartman/master · 540a2894
  由 Igor Canadi 提交于 12月 19, 2013
```
fix missing gflags library
```
  540a2894
19 12月, 2013 4 次提交

Add 'readtocache' test · ca92068b

由 Mark Callaghan 提交于 12月 18, 2013

Summary:
For some tests I want to cache the database prior to running other tests on the same invocation
of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
and it will still do a full read of --num rows with one thread. It might be invoked like:
  db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14739

ca92068b

Reorder tests · e914b649

由 Igor Canadi 提交于 12月 18, 2013

Summary:
db_test should be the first to execute because it finds the most bugs.

Also, when third parties report issues, we don't want ldb error message, we prefer to have db_test error message. For example, see thread: https://github.com/facebook/rocksdb/issues/25

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14715

e914b649

I
Merge pull request #35 from zizkovrb/rm-ds_store · cbb8da6f
由 Igor Canadi 提交于 12月 18, 2013
```
Remove utilities/.DS_Store file.
```
cbb8da6f
I
Merge pull request #37 from mlin/more-c-bindings · 3b50b621
由 Igor Canadi 提交于 12月 18, 2013
```
C bindings: add a bunch of the newer options
```
3b50b621

18 12月, 2013 2 次提交

Move level0 sorting logic from Version::SaveTo() to Version::Finalize() · 14995a8f

由 Siying Dong 提交于 12月 11, 2013

Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.

Test Plan: make all check

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14607

14995a8f

Get() Does Not Reserve space for to_delete memtables · a8b8b11d

由 Siying Dong 提交于 12月 17, 2013

Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14697

a8b8b11d

17 12月, 2013 1 次提交
- J
  
  Remove .DS_Store files. · 8c34189f
  由 Josef Šimánek 提交于 12月 14, 2013
  
  8c34189f
16 12月, 2013 1 次提交
- M
  
  C bindings: add a bunch of the newer options · 2a2506b6
  由 Mike Lin 提交于 12月 13, 2013
  
  2a2506b6
13 12月, 2013 3 次提交

[backupable db] Delete db_dir children when restoring backup · 417b453f

由 Igor Canadi 提交于 12月 12, 2013

Summary:
I realized that manifest will get deleted by PurgeObsoleteFiles in DBImpl, but it is sill cleaner to delete
files before we restore the backup

Test Plan: backupable_db_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14619

417b453f

Add monitoring for universal compaction and add counters for compaction IO · e9e6b00d

由 Mark Callaghan 提交于 12月 09, 2013

Summary:
Adds these counters
{ WAL_FILE_SYNCED, "rocksdb.wal.synced" }
  number of writes that request a WAL sync
{ WAL_FILE_BYTES, "rocksdb.wal.bytes" },
  number of bytes written to the WAL
{ WRITE_DONE_BY_SELF, "rocksdb.write.self" },
  number of writes processed by the calling thread
{ WRITE_DONE_BY_OTHER, "rocksdb.write.other" },
  number of writes not processed by the calling thread. Instead these were
  processed by the current holder of the write lock
{ WRITE_WITH_WAL, "rocksdb.write.wal" },
  number of writes that request WAL logging
{ COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" },
  number of bytes read during compaction
{ COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" },
  number of bytes written during compaction

Per-interval stats output was updated with WAL stats and correct stats for universal compaction
including a correct value for write-amplification. It now looks like:
                               Compactions
Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall Stall-cnt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7      464  46.4       281      3411      3875      3411         0      3875        2.1      12.1        13.8      621        0      240      240      628       0.0         0
Uptime(secs): 310.8 total, 2.0 interval
Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB
WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written
Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write
Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write
Amplification cumulative: 4.1 write, 6.8 compaction
Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB
WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written
Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write
Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write
Amplification interval: 101.7 write, 102.9 compaction
Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown
Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown

Task ID: #3329644, #3301695

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14583

e9e6b00d

I

portable %lu printing · 249e736b
由 Igor Canadi 提交于 12月 12, 2013

249e736b

12 12月, 2013 6 次提交

Add readrandom with both memtable and sst regression test · f5f5c645

由 Igor Canadi 提交于 12月 11, 2013

Summary: @MarkCallaghan's tests indicate that performance with 8k rows in memtable is much worse than empty memtable. I wanted to add a regression tests that measures this effect, so we could optimize it. However, current config shows 634461 QPS on my devbox. Mark, any idea why this is so much faster than your measurements?

Test Plan: Ran the regression test.

Reviewers: MarkCallaghan, dhruba, haobo

Reviewed By: MarkCallaghan

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D14511

f5f5c645

Introduce MergeContext to Lazily Initialize merge operand list · a8029fdc

由 Siying Dong 提交于 12月 02, 2013

Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc

a8029fdc

J

oops - missed a spot · c28dd2a8
由 James Golick 提交于 12月 11, 2013

c28dd2a8

[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting... · bc5dd19b

由 Siying Dong 提交于 12月 09, 2013

[RocksDB Performance Branch] Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()

Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: dhruba

CC: nkg-, igor, leveldb

Differential Revision: https://reviews.facebook.net/D14409

bc5dd19b

When flushing mem tables, create iterators out of mutex · 0304e3d2

由 Siying Dong 提交于 12月 10, 2013

Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.

Test Plan: make all check

Reviewers: haobo, dhruba, kailiu

Reviewed By: dhruba

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D14577

Conflicts:
	db/db_impl.cc

0304e3d2

[RocksDB perf] Cache speedup · e8d40c31

由 Igor Canadi 提交于 12月 11, 2013

Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.

This is an effort to optimize LRUCache.

Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:

    /db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8

I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14571

e8d40c31

11 12月, 2013 9 次提交
- J
  
  only try to use fallocate if it's actually present on the system · 43c386b7
  由 James Golick 提交于 12月 10, 2013
  
  43c386b7
- I
  BackupableDB delete backups with newer seq number · 5e4ab767
  由 Igor Canadi 提交于 12月 10, 2013
```
Summary: We now delete backups with newer sequence number, so the clients don't have to handle confusing situations when they restore from backup.

Test Plan: added a unit test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14547
```
  5e4ab767
- I
  
  Get rid of LogFlush() in InternalIterator · 204bb9cf
  由 Igor Canadi 提交于 12月 10, 2013
  
  204bb9cf
- I
  Don't LogFlush() in foreground threads · 19f5463d
  由 Igor Canadi 提交于 12月 10, 2013
```
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
```
  19f5463d
- S
  
  Fix another sign and unsign comparison in test · 4815468b
  由 Siying Dong 提交于 12月 10, 2013
  
  4815468b
- I
  
  fix comparison between signed and unsigned · cbe7ffef
  由 Igor Canadi 提交于 12月 10, 2013
  
  cbe7ffef
- I
  Cleaning up BackupableDB + fix valgrind errors · 7cf57284
  由 Igor Canadi 提交于 12月 10, 2013
```
Summary: Valgrind complained about BackupableDB. This fixes valgrind errors. Also, I cleaned up some code.

Test Plan: valgrind does not complain anymore

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14529
```
  7cf57284
- I
  Merge pull request #29 from sepeth/fix-shared-lib-build · 0e2c966f
  由 Igor Canadi 提交于 12月 10, 2013
```
Build shared lib and put linker flags to the end
```
  0e2c966f
- I
  Merge pull request #31 from sepeth/c-api · a204dabb
  由 Igor Canadi 提交于 12月 10, 2013
```
Rename leveldb to rocksdb in C api
```
  a204dabb
10 12月, 2013 4 次提交

D

Rename leveldb to rocksdb in C api · 6c4e110c
由 Doğan Çeçen 提交于 12月 10, 2013

6c4e110c
D

Fix shared lib build · f6012ab8
由 Doğan Çeçen 提交于 12月 08, 2013

f6012ab8
I

Fix unused variable warning · 784e62f9
由 Igor Canadi 提交于 12月 09, 2013

784e62f9

[RocksDB] BackupableDB · fb9fce4f

由 Igor Canadi 提交于 12月 09, 2013

Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.

There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions

Here is the directory structure I use:

backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files

All the files are ref counted and deleted immediatelly when they get out of scope.

Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.

Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.

Also, `make asan_check`

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14295

fb9fce4f

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功