提交 · f6082d194439c0fc3992ab343d63e2389f867ee5 · kvdb / rocksdb

01 11月, 2017 3 次提交

Blob DB: cleanup unused options · f6082d19

由 Yi Wu 提交于 10月 31, 2017

Summary:
* cleanup num_concurrent_simple_blobs. We don't do concurrent writes (by taking write_mutex_) so it doesn't make sense to have multiple non TTL files open. We can revisit later when we want to improve writes.
* cleanup eviction callback. we don't have plan to use it now.
* rename s/open_simple_blob_files_/open_non_ttl_file_/ and s/open_blob_files_/open_ttl_files_/ to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/3088

Differential Revision: D6182598

Pulled By: yiwu-arbug

fbshipit-source-id: 99e6f5e01fa66d31309cdb06ce48502464bac6ad

f6082d19

Blob DB: Initialize all fields in Blob Header, Footer and Record structs · f5078dde

由 Sagar Vemuri 提交于 10月 31, 2017

Summary:
Fixing un-itializations caught by valgrind.
Closes https://github.com/facebook/rocksdb/pull/3103

Differential Revision: D6200195

Pulled By: sagar0

fbshipit-source-id: bf35a3fb03eb1d308e4c5ce30dee1e345d7b03b3

f5078dde

Make writable_file_max_buffer_size dynamic · 33c7d4cc

由 Shaohua Li 提交于 10月 31, 2017

Summary:
The DBOptions::writable_file_max_buffer_size can be changed dynamically.
Closes https://github.com/facebook/rocksdb/pull/3053

Differential Revision: D6152720

Pulled By: shligit

fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681

33c7d4cc

31 10月, 2017 2 次提交

Fix removed structurally dead return statement · c1be8d86

由 Prashant D 提交于 10月 31, 2017

Summary:
There seems to be a typo mistake in env ReuseWritableFile func
where status is being returned twice.
Closes https://github.com/facebook/rocksdb/pull/3099

Differential Revision: D6196204

Pulled By: ajkr

fbshipit-source-id: abb6e3e1c1e772dd485fc39e7f1b9d502fa188fe

c1be8d86

db_stress snapshot compatibility with reopens · 4d43c6a6

由 Andrew Kryczka 提交于 10月 31, 2017

Summary:
- Release all snapshots before crashing and reopening the DB. Without this, we may attempt to release snapshots from an old DB using a new DB. That tripped an assertion.
- Release multiple snapshots in the same operation if needed. Without this, we would sometimes leak snapshots.
Closes https://github.com/facebook/rocksdb/pull/3098

Differential Revision: D6194923

Pulled By: ajkr

fbshipit-source-id: b9c89bcca7ebcbb6c7802c616f9d1175a005aadf

4d43c6a6

30 10月, 2017 1 次提交

fix tracking oldest snapshot for bottom-level compaction · b7bc9cc0

由 Andrew Kryczka 提交于 10月 30, 2017

Summary:
The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case.
Closes https://github.com/facebook/rocksdb/pull/3080

Differential Revision: D6185257

Pulled By: ajkr

fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd

b7bc9cc0

29 10月, 2017 1 次提交

Return Status::InvalidArgument if user request sync write while disabling WAL · 792ef10c

由 Yi Wu 提交于 10月 28, 2017

Summary:
write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options.
Closes https://github.com/facebook/rocksdb/pull/3086

Differential Revision: D6176822

Pulled By: yiwu-arbug

fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071

792ef10c

28 10月, 2017 7 次提交

always drop tombstones compacted to bottommost level · 6a9335db

由 Andrew Kryczka 提交于 10月 27, 2017

Summary:
Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels.
Closes https://github.com/facebook/rocksdb/pull/3085

Differential Revision: D6175742

Pulled By: ajkr

fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b

6a9335db

TableProperty::oldest_key_time defaults to 0 · 84a04af9

由 Yi Wu 提交于 10月 27, 2017

Summary:
We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.

Also revert db_sst_test back to before #2842.
Closes https://github.com/facebook/rocksdb/pull/3079

Differential Revision: D6165702

Pulled By: yiwu-arbug

fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0

84a04af9

Mark files as trash by using .trash extension · 05993155

由 Islam AbdelRahman 提交于 10月 27, 2017

Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970

Differential Revision: D5976805

Pulled By: IslamAbdelRahman

fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343

05993155

Blob DB: update blob file format · 3ebb7ba7

由 Yi Wu 提交于 10月 27, 2017

Summary:
Changing blob file format and some code cleanup around the change. The change with blob log format are:
* Remove timestamp field in blob file header, blob file footer and blob records. The field is not being use and often confuse with expiration field.
* Blob file header now come with column family id, which always equal to default column family id. It leaves room for future support of column family.
* Compression field in blob file header now is a standalone byte (instead of compact encode with flags field)
* Blob file footer now come with its own crc.
* Key length now being uint64_t instead of uint32_t
* Blob CRC now checksum both key and value (instead of value only).
* Some reordering of the fields.

The list of cleanups:
* Better inline comments in blob_log_format.h
* rename ttlrange_t and snrange_t to ExpirationRange and SequenceRange respectively.
* simplify blob_db::Reader
* Move crc checking logic to inside blob_log_format.cc
Closes https://github.com/facebook/rocksdb/pull/3081

Differential Revision: D6171304

Pulled By: yiwu-arbug

fbshipit-source-id: e4373e0d39264441b7e2fbd0caba93ddd99ea2af

3ebb7ba7

Enable cacheline_aligned_alloc() to allocate from jemalloc if enabled. · 682db813

由 Dmitri Smirnov 提交于 10月 27, 2017

Summary:
Reuse WITH_JEMALLOC option in preparation for module search unification.
  Move jemalloc overrides into a separate .cc
  Remote obsolete JEMALLOC_NOINIT option.
Closes https://github.com/facebook/rocksdb/pull/3078

Differential Revision: D6174826

Pulled By: yiwu-arbug

fbshipit-source-id: 9970a0289b4490272d15853920d9d7531af91140

682db813

Fix coverity uninitialized fields warnings in lru_cache · d9240b54

由 Prashant D 提交于 10月 27, 2017

Summary:
Coverity uninitialized member variable warnings in lru_cache
Closes https://github.com/facebook/rocksdb/pull/3082

Differential Revision: D6173062

Pulled By: sagar0

fbshipit-source-id: 7bcfc653457bd362d46045d06527838c9a6adad6

d9240b54

Fix coverity issues column_family, compaction_db/iterator · 50e95a63

由 Prashant D 提交于 10月 27, 2017

Summary:
db/column_family.h :
79 ColumnFamilyHandleInternal()

CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls.
80 : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}

db/compacted_db_impl.cc:
18CompactedDBImpl::CompactedDBImpl(
19 const DBOptions& options, const std::string& dbname)
20 : DBImpl(options, dbname) {
2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls.

CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls.
21}

db/compaction_iterator.cc:
9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls.
11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls.

CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls.
Closes https://github.com/facebook/rocksdb/pull/3084

Differential Revision: D6172999

Pulled By: sagar0

fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460

50e95a63

27 10月, 2017 4 次提交

Fix coverity uninitialized fields warnings · 47166bae

由 Prashant D 提交于 10月 26, 2017

Pulled By: ajkr

Differential Revision: D6170448

fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c

47166bae

Fix coverity issue for MutableDBOptions default constructor · 67b29e26

由 Prashant D 提交于 10月 26, 2017

Summary:
228MutableDBOptions::MutableDBOptions()
229    : max_background_jobs(2),
230      base_background_compactions(-1),
231      max_background_compactions(-1),
232      avoid_flush_during_shutdown(false),
233      delayed_write_rate(2 * 1024U * 1024U),
234      max_total_wal_size(0),
235      delete_obsolete_files_period_micros(6ULL * 60 * 60 * 1000000),
236      stats_dump_period_sec(600),
   	2. uninit_member: Non-static class member bytes_per_sync is not initialized in this constructor nor in any functions that it calls.

CID 1419857 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
4. uninit_member: Non-static class member wal_bytes_per_sync is not initialized in this constructor nor in any functions that it calls.
237      max_open_files(-1) {}
Closes https://github.com/facebook/rocksdb/pull/3069

Differential Revision: D6170424

Pulled By: ajkr

fbshipit-source-id: 1f94e86b87611ad2330b8b1707911150978d68b8

67b29e26

implement lower bound for iterators · 95667383

由 Andrew Kryczka 提交于 10月 26, 2017

Summary:
- for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
- for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
Closes https://github.com/facebook/rocksdb/pull/3074

Differential Revision: D6158654

Pulled By: ajkr

fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518

95667383

Blob DB: Inline small values in base DB · 5a2a6483

由 Yi Wu 提交于 10月 26, 2017

Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.

Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.

See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066

Differential Revision: D6142115

Pulled By: yiwu-arbug

fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0

5a2a6483

26 10月, 2017 3 次提交

single-file bottom-level compaction when snapshot released · 9b18cc23

由 Andrew Kryczka 提交于 10月 25, 2017

Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009

Differential Revision: D6062044

Pulled By: ajkr

fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21

9b18cc23

Return write error on reaching blob dir size limit · 96e3a600

由 Sagar Vemuri 提交于 10月 25, 2017

Summary:
I found that we continue accepting writes even when the blob db goes beyond the configured blob directory size limit. Now, we return an error for writes on reaching `blob_dir_size` limit and if `is_fifo` is set to false. (We cannot just drop any file when `is_fifo` is true.)

Deleting the oldest file when `is_fifo` is true will be handled in a later PR.
Closes https://github.com/facebook/rocksdb/pull/3060

Differential Revision: D6136156

Pulled By: sagar0

fbshipit-source-id: 2f11cb3f2eedfa94524fbfa2613dd64bfad7a23c

96e3a600

Fix tombstone scans in SeekForPrev outside prefix · addfe1ef

由 Islam AbdelRahman 提交于 10月 25, 2017

Summary:
When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true

Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev()
Closes https://github.com/facebook/rocksdb/pull/3067

Differential Revision: D6149638

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18

addfe1ef

25 10月, 2017 1 次提交

Fix build on OpenBSD · 386a57e6

由 zach shipko 提交于 10月 24, 2017

Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061

Differential Revision: D6138800

Pulled By: ajkr

fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c

386a57e6

24 10月, 2017 4 次提交

added missing subcodes and improved error message for missing enum values · 57fcdc26

由 zawlazaw 提交于 10月 23, 2017

Summary:
Java's `Status.SubCode` was out of sync with `include/rocksdb/status.h:SubCode`.

When running out of disc space this led to an `IllegalArgumentException` because of an invalid status code, rather than just returning the corresponding status code without an exception.

I added the missing status codes.

By this, we keep the behaviour of throwing an `IllegalArgumentException` in case of newly added status codes that are defined in C but not in Java.

We could think of an alternative strategy: add in Java another code "UnknownCode" which acts as a catch-all for all those status codes that are not yet mirrored from C to Java. This approach would never throw an exception but simply return a non-OK status-code.

I think the current approach of throwing an Exception in case of a C/Java inconsistency is fine, but if you have some opinion on the alternative strategy, then feel free to comment here.
Closes https://github.com/facebook/rocksdb/pull/3050

Differential Revision: D6129682

Pulled By: sagar0

fbshipit-source-id: f2bf44caad650837cffdcb1f93eb793b43580c66

57fcdc26

Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e

由 Yi Wu 提交于 10月 23, 2017

Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a

66a2c44e

Fix unused var warnings in Release mode · d2a65c59

由 Dmitri Smirnov 提交于 10月 23, 2017

Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048

Differential Revision: D6126272

Pulled By: maysamyabandeh

fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f

d2a65c59

Enable two write queues for transactions · 63822eb7

由 Maysam Yabandeh 提交于 10月 23, 2017

Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046

Differential Revision: D6106534

Pulled By: maysamyabandeh

fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c

63822eb7

21 10月, 2017 6 次提交

Exclude DBTest.DynamicFIFOCompactionOptions test under RocksDB Lite · a02ed126

由 Sagar Vemuri 提交于 10月 20, 2017

Summary:
This test shouldn't be enabled under the lite version; and this fixes the failing contrun test due to #3006.
Closes https://github.com/facebook/rocksdb/pull/3056

Differential Revision: D6114681

Pulled By: sagar0

fbshipit-source-id: dc5243549ae6b1353cec7edb820c771d95f66dda

a02ed126

Split CompactionFilterWithValueChange · 3ef55d2c

由 Maysam Yabandeh 提交于 10月 20, 2017

Summary:
The test currently times out when it is run under tsan. This patch split it into 4 tests.
Closes https://github.com/facebook/rocksdb/pull/3047

Differential Revision: D6106515

Pulled By: maysamyabandeh

fbshipit-source-id: 03a28cdf8b1c097be2361b1b0cc3dc1acf2b5d63

3ef55d2c

db_stress support long-held snapshots · d75793d6

由 Andrew Kryczka 提交于 10月 20, 2017

Summary:
Add options to `db_stress` (correctness testing tool) to randomly acquire snapshot and release it after some period of time. It's useful for correctness testing of #3009, as well as other parts of compaction that behave differently depending on which snapshots are held.
Closes https://github.com/facebook/rocksdb/pull/3038

Differential Revision: D6086501

Pulled By: ajkr

fbshipit-source-id: 3ec0d8666c78ac507f1f808887c4ff759ba9b865

d75793d6

remove unused code · f8b5bb2f

由 Andrew Kryczka 提交于 10月 20, 2017

Summary:
fixup 6a541afc. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something.
Closes https://github.com/facebook/rocksdb/pull/3055

Differential Revision: D6114132

Pulled By: ajkr

fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c

f8b5bb2f

update HISTORY with recent changes · 57fd4a82

由 Andrew Kryczka 提交于 10月 20, 2017

Summary:
We should mention these:

- `EventListener::OnStallConditionsChanged()` in 01542400
- `DeleteRange()` fix in 966b32b5
Closes https://github.com/facebook/rocksdb/pull/3054

Differential Revision: D6113989

Pulled By: ajkr

fbshipit-source-id: d5e058e1ab07570df22936e8d5939fb30fb4d381

57fd4a82

Fix unstable floating point exception · ee2b1ec1

由 raistlin 提交于 10月 20, 2017

Summary:
Fix unstable floating point exception, tested on Windows, 64-bit build.
The problem appeared in `SetCapacity()` method at line

`high_pri_pool_capacity_ = capacity_ * high_pri_pool_ratio_;`

`high_pri_pool_ratio_` was not initialized at that moment, because
`SetHighPriorityPoolRatio()` is called after `SetCapacity()`. So,
`high_pri_pool_ratio_` contained garbage, which caused "Floating point
exception" sometimes.
Closes https://github.com/facebook/rocksdb/pull/3052

Differential Revision: D6111161

Pulled By: yiwu-arbug

fbshipit-source-id: d170329111ad12b4bf9bbcf37bcb6411523438ae

ee2b1ec1

20 10月, 2017 2 次提交

Make FIFO compaction options dynamically configurable · f0804db7

由 Sagar Vemuri 提交于 10月 19, 2017

Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.

Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`

Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006

Differential Revision: D6058619

Pulled By: sagar0

fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92

f0804db7

Enable MSVC W4 with a few exceptions. Fix warnings and bugs · ebab2e2d

由 Dmitri Smirnov 提交于 10月 19, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/3018

Differential Revision: D6079011

Pulled By: yiwu-arbug

fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57

ebab2e2d

19 10月, 2017 3 次提交

Update RocksDB Authors File · b7499945

由 Sagar Vemuri 提交于 10月 18, 2017

Summary: Update RocksDB Authors File.

Reviewed By: yiwu-arbug

Differential Revision: D6075453

fbshipit-source-id: dff52f483aab33c41de391f145a8273acfd6cbde

b7499945

Fix a typo in a comment · 7deed2b4

由 Gihwan Oh 提交于 10月 18, 2017

Summary:
instad of for specific level -> instead of a specific level
Closes https://github.com/facebook/rocksdb/pull/3040

Differential Revision: D6090811

Pulled By: sagar0

fbshipit-source-id: 499edef0a6f596c448f61791e6aca8f5cce08e9c

7deed2b4

WritePrepared Txn: Disable GC during recovery · 7e382389

由 Maysam Yabandeh 提交于 10月 18, 2017

Summary:
Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values.
Closes https://github.com/facebook/rocksdb/pull/2980

Differential Revision: D6000191

Pulled By: maysamyabandeh

fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675

7e382389

18 10月, 2017 3 次提交

expose a hook to skip tables during iteration · 7891af8b

由 Nikhil Benesch 提交于 10月 17, 2017

Summary:
As discussed on the mailing list (["Skipping entire SSTs while iterating"](https://groups.google.com/forum/#!topic/rocksdb/ujHCJVLrHlU)), this patch adds a `table_filter` to `ReadOptions` that allows specifying a callback to be executed during iteration before each table in the database is scanned. The callback is passed the table's properties; the table is scanned iff the callback returns true.

This can be used in conjunction with a `TablePropertiesCollector` to dramatically speed up scans by skipping tables that are known to contain irrelevant data for the scan at hand.

We're using this [downstream in CockroachDB](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine/db.cc#L2009-L2022) already. With this feature, under ideal conditions, we can reduce the time of an incremental backup in  from hours to seconds.

FYI, the first commit in this PR fixes a segfault that I unfortunately have not figured out how to reproduce outside of CockroachDB. I'm hoping you accept it on the grounds that it is not correct to return 8-byte aligned memory from a call to `malloc` on some 64-bit platforms; one correct approach is to infer the necessary alignment from `std::max_align_t`, as done here. As noted in the first commit message, the bug is tickled by having a`std::function` in `struct ReadOptions`. That is, the following patch alone is enough to cause RocksDB to segfault when run from CockroachDB on Darwin.

```diff
 --- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1546,6 +1546,13 @@ struct ReadOptions {
   // Default: false
   bool ignore_range_deletions;

+  // A callback to determine whether relevant keys for this scan exist in a
+  // given table based on the table's properties. The callback is passed the
+  // properties of each table during iteration. If the callback returns false,
+  // the table will not be scanned.
+  // Default: empty (every table will be scanned)
+  std::function<bool(const TableProperties&)> table_filter;
+
   ReadOptions();
   ReadOptions(bool cksum, bool cache);
 };
```

/cc danhhz
Closes https://github.com/facebook/rocksdb/pull/2265

Differential Revision: D5054262

Pulled By: yiwu-arbug

fbshipit-source-id: dd6b28f2bba6cb8466250d8c5c542d3c92785476

7891af8b

Blob DB: Store blob index as kTypeBlobIndex in base db · eaaef911

由 Yi Wu 提交于 10月 17, 2017

Summary:
Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.

Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
Closes https://github.com/facebook/rocksdb/pull/3000

Differential Revision: D6050044

Pulled By: yiwu-arbug

fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632

eaaef911

Blob DB: not writing sequence number as blob record footer · 0552029b

由 Yi Wu 提交于 10月 17, 2017

Summary:
Previously each time we write a blob we write blog_record_header + key + value + blob_record_footer to blob log. The footer only contains a sequence and a crc for the sequence number. The sequence number was used in garbage collection to verify the value is recent. After #2703 we moved to use optimistic transaction and no longer use sequence number from the footer. Remove the footer altogether.

There's another usage of sequence number and we are keeping it: Each blob log file keep track of sequence number range of keys in it, and use it to check if it is reference by a snapshot, before being deleted.
Closes https://github.com/facebook/rocksdb/pull/3005

Differential Revision: D6057585

Pulled By: yiwu-arbug

fbshipit-source-id: d6da53c457a316e9723f359a1b47facfc3ffe090

0552029b

kvdb / rocksdb 11 个月 前同步成功

kvdb / rocksdb
11 个月前同步成功