提交 · 7fdd5f5b337fe5da223ea08113653766e3537684 · kvdb / rocksdb

29 3月, 2013 1 次提交

Use non-mmapd files for Write-Ahead Files · 7fdd5f5b

由 Abhishek Kona 提交于 3月 28, 2013

Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.

There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
without this dif :
fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s

Test Plan: unit test included

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9741

7fdd5f5b

22 3月, 2013 2 次提交

Disable Unit Test for TransactionLogIteratorStall · 9b70529c

由 Abhishek Kona 提交于 3月 21, 2013

Summary:
The unit test fails as our solution does not work with MMap'd files.
Disable the failing unit test. Put it back with the next diff which should fix the problem.

Test Plan: db_test

Reviewers: heyongqiang

CC: dhruba

Differential Revision: https://reviews.facebook.net/D9645

9b70529c

TransactionLogIter should stall at the last record. Currently it errors out · 27c15fb6

由 Abhishek Kona 提交于 3月 21, 2013

Summary:
* Add a method to check if the log reader is at EOF.
* If we know a record has been flushed force the log_reader to believe it is not at EOF, using a new method UnMarkEof().

This does not work with MMpaed files.

Test Plan: added a unit test.

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9567

27c15fb6

21 3月, 2013 1 次提交

Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. · ad96563b

由 Dhruba Borthakur 提交于 3月 14, 2013

Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.

The default setting remains the same (and is backward compatible):
 1. use bufferedio
 2. do not use mmaps for reads
 3. use mmap for writes
 4. use readaheads for reads needed for compaction

I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.

Test Plan: make check

Reviewers: sheki, heyongqiang, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9429

ad96563b

20 3月, 2013 2 次提交

Fix more signed-unsigned comparisons · b1bea584

由 Mayank Agarwal 提交于 3月 19, 2013

Summary: Some comparisons left in log_test.cc and db_test.cc complained by make

Test Plan: make

Reviewers: dhruba, sheki

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9537

b1bea584

Ignore a zero-sized file while looking for a seq-no in GetUpdatesSince · 02c45980

由 Abhishek Kona 提交于 3月 18, 2013

Summary:
Rocksdb can create 0 sized log files when it is opened and closed without any operations.
The GetUpdatesSince fails currently if there is a log file of size zero.

This diff fixes this. If there is a log file is 0, it is removed form the probable_file_list

Test Plan: unit test

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9507

02c45980

07 3月, 2013 1 次提交

Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file · d68880a1

由 Abhishek Kona 提交于 3月 04, 2013

Summary:
Store the last flushed, seq no. in db_impl. Check against it in
transaction Log iterator. Do not attempt to read ahead if we do not know
if the data is flushed completely.
Does not work if flush is disabled. Any ideas on fixing that?
* Minor change, iter->Next is called the first time automatically for
* the first time.

Test Plan:
existing test pass.
More ideas on testing this?
Planning to run some stress test.

Reviewers: dhruba, heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9087

d68880a1

04 3月, 2013 2 次提交

Add rate_delay_limit_milliseconds · 993543d1

由 Mark Callaghan 提交于 3月 02, 2013

Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.

From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
                      because compaction for compressed levels is much
                      slower. When Lx is uncompressed and Lx+1 is compressed
                      then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                      compaction process is the first to be slowed by
                      compression.
* without compression -> at level 1

Task ID: #1832108

Blame Rev:

Test Plan:
run with real data, added test

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9045

993543d1

Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0. · 806e2643

由 Dhruba Borthakur 提交于 2月 28, 2013

Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.

This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.

Should we switch on this feature by default or should we keep this feature
turned off in the default settings?

Test Plan: Added test case to db_test.cc

Reviewers: sheki, vamsi, emayanke, heyongqiang

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8991

806e2643

01 3月, 2013 1 次提交

Codemod NULL to nullptr · c41f1e99

由 Abhishek Kona 提交于 2月 28, 2013

Summary:
scripted NULL to nullptr in
* include/leveldb/
* db/
* table/
* util/

Test Plan: make all check

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9003

c41f1e99

19 2月, 2013 1 次提交

Zero out redundant sequence numbers for kvs to increase compression efficiency · 45649154

由 Dhruba Borthakur 提交于 2月 15, 2013

Summary:
The sequence numbers in each record eat up plenty of space on storage.
The optimization zeroes out sequence numbers on kvs in the Lmax
layer that are earlier than the earliest snapshot.

Test Plan: Unit test attached.

Differential Revision: https://reviews.facebook.net/D8619

45649154

26 1月, 2013 1 次提交

Fix poor error on num_levels mismatch and few other minor improvements · 0b83a831

由 Chip Turner 提交于 1月 24, 2013

Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry."  Now you get a more verbose message
describing the issue.

Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).

Lastly, unique_ptr'ify a couple of minor calls.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8151

0b83a831

25 1月, 2013 1 次提交

Use fallocate to prevent excessive allocation of sst files and logs · 3dafdfb2

由 Chip Turner 提交于 1月 15, 2013

Summary:
On some filesystems, pre-allocation can be a considerable
amount of space.  xfs in our production environment pre-allocates by
1GB, for instance.  By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).

Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress.  LOG file without this change was 4M,
and with it it was 128k then grew to normal size.

Reviewers: dhruba

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D7953

3dafdfb2

24 1月, 2013 1 次提交

Fix a number of object lifetime/ownership issues · 2fdf91a4

由 Chip Turner 提交于 1月 20, 2013

Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.

Test Plan: db_stress, make check

Reviewers: dhruba

Reviewed By: dhruba

CC: zshao, leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D8043

2fdf91a4

17 1月, 2013 1 次提交

rollover manifest file. · 7d5a4383

由 Abhishek Kona 提交于 1月 10, 2013

Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?

Test Plan: make all check. Inputs on a new unit test?

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7701

7d5a4383

11 1月, 2013 1 次提交

Port fix for Leveldb manifest writing bug from Open-Source · 85ad13be

由 Abhishek Kona 提交于 1月 08, 2013

Summary:
Pretty much a blind copy of the patch in open source.
Hope to get this in before we make a release

Test Plan: make clean check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7809

85ad13be

18 12月, 2012 1 次提交

Added meta-database support. · 62d48571

由 Kosie van der Merwe 提交于 12月 17, 2012

Summary:
Added kMetaDatabase for meta-databases in db/filename.h along with supporting
fuctions.
Fixed switch in DBImpl so that it also handles kMetaDatabase.
Fixed DestroyDB() that it can handle destroying meta-databases.

Test Plan: make check

Reviewers: sheki, emayanke, vamsi, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7245

62d48571

13 12月, 2012 1 次提交

GetSequence API in write batch. · 2ba866e0

由 Abhishek Kona 提交于 12月 11, 2012

Summary:
WriteBatch is now used by the GetUpdatesSinceAPI. This API is external
and will be used by the rocks server. Rocks Server and others will need
to know about the Sequence Number in the WriteBatch. This public method
will allow for that.

Test Plan: make all check.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7293

2ba866e0

12 12月, 2012 1 次提交

Fix Bug in Binary Search for files containing a seq no. and delete Archived... · 22c28362

由 Abhishek Kona 提交于 12月 10, 2012

Fix Bug in Binary Search for files containing a seq no. and delete Archived Log Files during Destroy DB.

Summary:
* Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119
* Binary search is also overflow safe.
* Delete archive log files and archive dir during DestroyDB

Test Plan: make check

Reviewers: dhruba

CC: kosievdmerwe, emayanke

Differential Revision: https://reviews.facebook.net/D7263

22c28362

11 12月, 2012 2 次提交

An public api to fetch the latest transaction id. · 24fc3792

由 Dhruba Borthakur 提交于 12月 10, 2012

Summary:
Implement a interface to retrieve the most current transaction
id from the database.

Test Plan: Added unit test.

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7269

24fc3792

Refactor GetArchivalDirectoryName to filename.h · 1c6742e3

由 Abhishek Kona 提交于 12月 07, 2012

Summary:
filename.h has functions to do similar things.
Moving code away from db_impl.cc

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7251

1c6742e3

08 12月, 2012 1 次提交

GetUpdatesSince API to enable replication. · 80550089

由 Abhishek Kona 提交于 11月 29, 2012

Summary:
How it works:
* GetUpdatesSince takes a SequenceNumber.
* A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found.
* Seek in the logFile till the requested SeqNumber is found.
* Return an iterator which contains logic to return record's one by one.

Test Plan:
* Test case included to check the good code path.
* Will update with more test-cases.
* Feedback required on test-cases.

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7119

80550089

29 11月, 2012 3 次提交

Move WAL files to archive directory, instead of deleting. · d4627e6d

由 sheki 提交于 11月 26, 2012

Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.

Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6975

d4627e6d

Fix all the lint errors. · d29f1819

由 Abhishek Kona 提交于 11月 28, 2012

Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.

Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.

Test Plan: make all check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7059

d29f1819

Delete non-visible keys during a compaction even in the presense of snapshots. · 9a357847

由 Dhruba Borthakur 提交于 11月 26, 2012

Summary:
 LevelDB should delete almost-new keys when a long-open snapshot exists.
The previous behavior is to keep all versions that were created after the
oldest open snapshot. This can lead to database size bloat for
high-update workloads when there are long-open snapshots and long-open
snapshot will be used for logical backup. By "almost new" I mean that the
key was updated more than once after the oldest snapshot.

If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if
we find two instances of the same key k1 that lie entirely within s1 and
s2 (i.e. s1 < k1 < s2), then the earlier version
of k1 can be safely deleted because that version is not visible in any snapshot.

Test Plan:
unit test attached
make clean check

Differential Revision: https://reviews.facebook.net/D6999

9a357847

20 11月, 2012 2 次提交

Fix compilation error while compiling unit tests with OPT=-g · 74054fa9

由 Dhruba Borthakur 提交于 11月 19, 2012

Summary:
Fix compilation error while compiling with OPT=-g

Test Plan:
make clean check OPT=-g

Reviewers:

CC:

Task ID: #

Blame Rev:

74054fa9

Fix a coding error in db_test.cc · 65b035a4

由 amayank 提交于 11月 16, 2012

Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not

Test Plan: make all;./db_test

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: sheki

Differential Revision: https://reviews.facebook.net/D6771

65b035a4

17 11月, 2012 1 次提交

Fix a coding error in db_test.cc · de278a6d

由 amayank 提交于 11月 16, 2012

Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not

Test Plan: make all;./db_test

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: sheki

Differential Revision: https://reviews.facebook.net/D6771

de278a6d

14 11月, 2012 1 次提交

Improved CompactionFilter api: pass in a opaque argument to CompactionFilter invocation. · 5d16e503

由 Dhruba Borthakur 提交于 11月 13, 2012

Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.

Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6711

5d16e503

07 11月, 2012 2 次提交

Flush Data at object destruction if disableWal is used. · 4e413df3

由 Abhishek Kona 提交于 11月 06, 2012

Summary:
Added a conditional flush in ~DBImpl to flush.
There is still a chance of writes not being persisted if there is a
crash (not a clean shutdown) before the DBImpl instance is destroyed.

Test Plan: modified db_test to meet the new expectations.

Reviewers: dhruba, heyongqiang

Differential Revision: https://reviews.facebook.net/D6519

4e413df3

Fix all warnings generated by -Wall option to the compiler. · aa42c668

由 Dhruba Borthakur 提交于 11月 06, 2012

Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.

Test Plan: make all check

Reviewers: heyongqiang, emayanke, sheki

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6525

aa42c668

06 11月, 2012 1 次提交

Ability to invoke application hook for every key during compaction. · 5273c814

由 Dhruba Borthakur 提交于 10月 29, 2012

Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.

A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.

Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.

I will be writing test cases for this api in mynext version of
this patch.

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

CC: sheki, adsharma

Differential Revision: https://reviews.facebook.net/D6285

5273c814

05 11月, 2012 1 次提交

Compilation problem introduced by previous · a1bd5b77

由 Dhruba Borthakur 提交于 11月 04, 2012

commit 854c66b0.

Summary:
Compilation problem introduced by previous
commit 854c66b0.

Test Plan:  make check

a1bd5b77

03 11月, 2012 1 次提交

Make compression options configurable. These include window-bits, level and... · 854c66b0

由 amayank 提交于 11月 01, 2012

Make compression options configurable. These include window-bits, level and strategy for ZlibCompression

Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105

Test Plan: make all check

Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang

Differential Revision: https://reviews.facebook.net/D6393

854c66b0

30 10月, 2012 4 次提交

fix test failure · fb8d4373

由 heyongqiang 提交于 10月 29, 2012

Summary: as subject

Test Plan: db_test

Reviewers: dhruba, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6309

fb8d4373

add a test case to make sure chaning num_levels will fail Summary: · 925f60d3

由 heyongqiang 提交于 10月 29, 2012

Summary: as subject

Test Plan: db_test

Reviewers: dhruba, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6303

925f60d3

Allow having different compression algorithms on different levels. · 321dfdc3

由 Dhruba Borthakur 提交于 10月 27, 2012

Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.

This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction.  Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.

The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.

Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.

With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second

With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second

This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6261

321dfdc3

Fix unit test failure caused by delaying deleting obsolete files. · de7689b1

由 Dhruba Borthakur 提交于 10月 29, 2012

Summary:
A previous commit 4c107587 introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.

Use GetLiveFiles() insteads to count the number of live files in the database.

Test Plan:
make check

de7689b1

26 10月, 2012 1 次提交

Fix unit test failure caused by delaying deleting obsolete files. · 8eedf13a

由 Dhruba Borthakur 提交于 10月 25, 2012

Summary:
A previous commit 4c107587 introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.

Use GetLiveFiles() insteads to count the number of live files in the database.

Test Plan: make check

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6207

8eedf13a

29 9月, 2012 1 次提交

Trigger read compaction only if seeks to storage are incurred. · c1bb32e1

由 Dhruba Borthakur 提交于 9月 27, 2012

Summary:
In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied.

This patch counts a 'seek' only if the file fails the bloom filter
check and has to read in data block(s) from the storage.

This patch also counts a 'seek' if a file is not present in the file-cache, because opening a file means that its index blocks need to be read into cache.

Test Plan: unit test attached. I will probably add one more unti tests.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5709

c1bb32e1

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功