提交 · d59549298fcd50e4aaa59af3dfc039d9a4db5623 · kvdb / rocksdb

04 5月, 2018 1 次提交

Skip deleted WALs during recovery · d5954929

由 Siying Dong 提交于 5月 03, 2018

Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729

d5954929

02 5月, 2018 1 次提交

avoid double delete on dummy record insertion failure · 6cab3184

由 Zhongyi Xie 提交于 5月 01, 2018

Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688

Differential Revision: D7537741

Pulled By: miasantreble

fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6

6cab3184

28 4月, 2018 3 次提交

expose WAL iterator in the C API · c9ace1d8

由 Victor Grishchenko 提交于 4月 27, 2018

Summary:
A minor change: I wrapped TransactionLogIterator for the C API.
I needed that for the golang binding.
Closes https://github.com/facebook/rocksdb/pull/3304

Differential Revision: D6628736

Pulled By: miasantreble

fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a

c9ace1d8

Add max_subcompactions as a compaction option · ed7a95b2

由 Huachao Huang 提交于 4月 27, 2018

Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775

Differential Revision: D7792357

Pulled By: ajkr

fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c

ed7a95b2

Rename pending_compaction_ to queued_for_compaction_. · 7dfbe335

由 Yanqin Jin 提交于 4月 27, 2018

Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781

Differential Revision: D7790063

Pulled By: riversand963

fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804

7dfbe335

27 4月, 2018 2 次提交

Rename pending_flush_ to queued_for_flush_. · 513b5ce6

由 Yanqin Jin 提交于 4月 26, 2018

Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush

```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```

`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777

Differential Revision: D7783066

Pulled By: riversand963

fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229

513b5ce6

Sync parent directory after deleting a file in delete scheduler · 63c965cd

由 Siying Dong 提交于 4月 26, 2018

Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767

Differential Revision: D7760136

Pulled By: siying

fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f

63c965cd

26 4月, 2018 1 次提交

Rate limiter should be allowed to share between different rocksdb instances in C API · 7c9f23e6

由 Vincent Lee 提交于 4月 25, 2018

Summary:
Currently, the `rocksdb_options_set_ratelimiter` in  `c.cc` will change the input to nil, which make it is
 not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option.

In this pr, I changed it to shared ptr.
Closes https://github.com/facebook/rocksdb/pull/3758

Differential Revision: D7749740

Pulled By: ajkr

fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5

7c9f23e6

24 4月, 2018 2 次提交

Improve write time breakdown stats · affe01b0

由 Mike Kolupaev 提交于 4月 23, 2018

Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.

These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602

Differential Revision: D7251562

Pulled By: al13n321

fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d

affe01b0

Revert "Skip deleted WALs during recovery" · d5afa737

由 Siying Dong 提交于 4月 23, 2018

Summary:
This reverts commit 73f21a7b.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77

d5afa737

21 4月, 2018 3 次提交

Add a stat for MultiGet keys found, update memtable hit/miss stats · dbdaa466

由 Anand Ananthabhotla 提交于 4月 20, 2018

Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730

Differential Revision: D7677364

Pulled By: anand1976

fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5

dbdaa466

WritePrepared Txn: enable TryAgain for duplicates at the end of the batch · c3d1e36c

由 Maysam Yabandeh 提交于 4月 20, 2018

Summary:
The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
Closes https://github.com/facebook/rocksdb/pull/3747

Differential Revision: D7708391

Pulled By: maysamyabandeh

fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d

c3d1e36c

Fix GitHub issue #3716: gcc-8 warnings · dee95a1a

由 przemyslaw.skibinski@percona.com 提交于 4月 20, 2018

Summary:
Fix the following gcc-8 warnings:
- conflicting C language linkage declaration [-Werror]
- writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
- array subscript -1 is below array bounds [-Werror=array-bounds]

Solves https://github.com/facebook/rocksdb/issues/3716
Closes https://github.com/facebook/rocksdb/pull/3736

Differential Revision: D7684161

Pulled By: yiwu-arbug

fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a

dee95a1a

20 4月, 2018 1 次提交

check return status for Sync() and Append() calls to avoid corruption · e1e826b9

由 Zhongyi Xie 提交于 4月 19, 2018

Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740

Differential Revision: D7678848

Pulled By: miasantreble

fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77

e1e826b9

19 4月, 2018 1 次提交

Add block cache related DB properties · ad511684

由 Yi Wu 提交于 4月 18, 2018

Summary:
Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
Closes https://github.com/facebook/rocksdb/pull/3734

Differential Revision: D7657180

Pulled By: yiwu-arbug

fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd

ad511684

17 4月, 2018 1 次提交

Initialize a boolean member variable of a struct. · 5e488118

由 Yanqin Jin 提交于 4月 16, 2018

Summary:
The reason for this initialization is that LLVM UBSAN check will fail due to
uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool).

UBSAN log:
> ===== Running external_sst_file_basic_test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from ExternalSSTFileBasicTest
[ RUN      ] ExternalSSTFileBasicTest.Basic
[       OK ] ExternalSSTFileBasicTest.Basic (6 ms)
[ RUN      ] ExternalSSTFileBasicTest.NoCopy
db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool'

miasantreble  I've tested this locally using the following command.
```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check
```

ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables.
Closes https://github.com/facebook/rocksdb/pull/3728

Differential Revision: D7642476

Pulled By: riversand963

fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e

5e488118

16 4月, 2018 1 次提交

fix memory leak in two_level_iterator · 954b496b

由 Zhongyi Xie 提交于 4月 15, 2018

Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146

954b496b

14 4月, 2018 2 次提交

add kEntryRangeDeletion · 31ee4bf2

由 zhangjinpeng1987 提交于 4月 13, 2018

Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695

Differential Revision: D7570322

Pulled By: ajkr

fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369

31ee4bf2

Improve accuracy of I/O stats collection of external SST ingestion. · c81b0abe

由 Yanqin Jin 提交于 4月 13, 2018

Summary:
RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.

Rename a sync point.
Closes https://github.com/facebook/rocksdb/pull/3713

Differential Revision: D7604151

Pulled By: riversand963

fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586

c81b0abe

13 4月, 2018 1 次提交

comment unused parameters to turn on -Wunused-parameter flag · 3be9b364

由 David Lai 提交于 4月 12, 2018

Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352

3be9b364

12 4月, 2018 1 次提交

Improve visibility into the reasons for compaction. · d42bd041

由 Yanqin Jin 提交于 4月 11, 2018

Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679

Differential Revision: D7550348

Pulled By: riversand963

fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d

d42bd041

11 4月, 2018 3 次提交

fix calling SetOptions on deprecated options · 019d7894

由 Andrew Kryczka 提交于 4月 10, 2018

Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:

```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```

We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700

Differential Revision: D7572596

Pulled By: ajkr

fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1

019d7894

fix some text in comments. · d95014b9

由 Yanqin Jin 提交于 4月 10, 2018

Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
   our best to conform to conventions. Specifically, use 'callback' instead of
   'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693

Differential Revision: D7560396

Pulled By: riversand963

fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8

d95014b9

make MockTimeEnv::current_time_ atomic to fix data race · 2770a94c

由 Zhongyi Xie 提交于 4月 10, 2018

Summary:
fix a new TSAN failure
https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
Closes https://github.com/facebook/rocksdb/pull/3694

Differential Revision: D7565310

Pulled By: miasantreble

fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825

2770a94c

10 4月, 2018 3 次提交

Change a comment · 65fe8d6c

由 Gihwan Oh 提交于 4月 09, 2018

Summary:
In this case, we add input files of compaction, not outputs.
Closes https://github.com/facebook/rocksdb/pull/3686

Differential Revision: D7556781

Pulled By: ajkr

fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7

65fe8d6c

fix intra-L0 FIFO for uncompressed use case · 1c27cbfb

由 Andrew Kryczka 提交于 4月 09, 2018

Summary:
- inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
- fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
Closes https://github.com/facebook/rocksdb/pull/3684

Differential Revision: D7530396

Pulled By: ajkr

fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e

1c27cbfb

fix data race · f3a1d9e0

由 Zhongyi Xie 提交于 4月 09, 2018

Summary:
Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`:
https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899
Closes https://github.com/facebook/rocksdb/pull/3691

Differential Revision: D7541400

Pulled By: miasantreble

fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb

f3a1d9e0

08 4月, 2018 1 次提交

WritePrepared Txn: add stats · bde1c1a7

由 Maysam Yabandeh 提交于 4月 07, 2018

Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683

Differential Revision: D7529393

Pulled By: maysamyabandeh

fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9

bde1c1a7

07 4月, 2018 1 次提交

Fix typo · 74767dee

由 Gihwan Oh 提交于 4月 06, 2018

Summary:
regrad -> regard
Closes https://github.com/facebook/rocksdb/pull/3685

Differential Revision: D7540952

Pulled By: miasantreble

fbshipit-source-id: e08c9389f7fccf401c962a4441b62cd5e73a33ad

74767dee

06 4月, 2018 3 次提交

Support for Column family specific paths. · 446b32cf

由 Phani Shekhar Mantripragada 提交于 4月 05, 2018

Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d

446b32cf

Fix pre_release callback argument list. · 147dfc7b

由 Dmitri Smirnov 提交于 4月 05, 2018

Summary:
Primitive types constness does not affect the signature of the
  method and has no influence on whether the overriding method would
  actually have that const bool instead of just bool. In addition,
  it is rarely useful but does produce a compatibility warnings
  in VS 2015 compiler.
Closes https://github.com/facebook/rocksdb/pull/3663

Differential Revision: D7475739

Pulled By: ajkr

fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f

147dfc7b

fix build for rocksdb lite · c827b2dc

由 Zhongyi Xie 提交于 4月 05, 2018

Summary:
currently rocksdb lite build fails due to the following errors:
> db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
   virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                                                   ^
db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
   virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                ^
db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
 class FlushedFileCollector : public EventListener {
       ^
db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
     flushed_files_.push_back(info.file_path);
                                   ^
cc1plus: all warnings being treated as errors
make: *** [db/db_sst_test.o] Error 1
Closes https://github.com/facebook/rocksdb/pull/3676

Differential Revision: D7493006

Pulled By: miasantreble

fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f

c827b2dc

05 4月, 2018 1 次提交

Ttl-triggered and snapshot-release-triggered compactions should not be manual compactions · 7d906799

由 Sagar Vemuri 提交于 4月 05, 2018

Summary:
Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug.
Closes https://github.com/facebook/rocksdb/pull/3678

Differential Revision: D7498151

Pulled By: sagar0

fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d

7d906799

03 4月, 2018 4 次提交

Level Compaction with TTL · 04c11b86

由 Sagar Vemuri 提交于 4月 02, 2018

Summary:
Level Compaction with TTL.

As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.

Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.

This functionality can be controlled by the newly introduced column family option -- ttl.

TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591

Differential Revision: D7275442

Pulled By: sagar0

fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267

04c11b86

WritePrepared Txn: smallest_prepare optimization · b225de7e

由 Maysam Yabandeh 提交于 4月 02, 2018

Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649

Differential Revision: D7388630

Pulled By: maysamyabandeh

fbshipit-source-id: b79506238c17467d590763582960d4d90181c600

b225de7e

Enable cancelling manual compactions if they hit the sfm size limit · 1579626d

由 Amy Tai 提交于 4月 02, 2018

Summary:
Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
Closes https://github.com/facebook/rocksdb/pull/3670

Differential Revision: D7457683

Pulled By: amytai

fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5

1579626d

Revert "Avoid adding tombstones of the same file to RangeDelAggregato… · 44653c7b

由 Zhongyi Xie 提交于 4月 02, 2018

Summary:
…r multiple times"

This reverts commit e80709a3.

lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads
I'm reverting the commit for now but feel free to submit new patches 😃

To reproduce the regression, you can run the following db_bench command
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

write stats printed by db_bench:

Table | | | | | | | | | | |
 --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
revert commit | Percentiles: | P50: | 80.77  | P75: |102.94  |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 |
keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70|  P99.99: 2814.29|
Closes https://github.com/facebook/rocksdb/pull/3672

Differential Revision: D7463315

Pulled By: miasantreble

fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d

44653c7b

31 3月, 2018 2 次提交

Throw NoSpace instead of IOError when out of space. · d12112d0

由 Fosco Marotto 提交于 3月 30, 2018

Summary:
Replaces #1702 and is updated from feedback.
Closes https://github.com/facebook/rocksdb/pull/3531

Differential Revision: D7457395

Pulled By: gfosco

fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82

d12112d0

Skip deleted WALs during recovery · 73f21a7b

由 Maysam Yabandeh 提交于 3月 30, 2018

Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488

Differential Revision: D6967893

Pulled By: maysamyabandeh

fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1

73f21a7b

30 3月, 2018 1 次提交

WritePrepared Txn: fix a bug in publishing recoverable state seq · 89d989ed

由 Maysam Yabandeh 提交于 3月 29, 2018

Summary:
When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
Closes https://github.com/facebook/rocksdb/pull/3665

Differential Revision: D7446099

Pulled By: maysamyabandeh

fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b

89d989ed

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功