提交 · 132013366d202dbd5d5c6463c60898bdd420055b · kvdb / rocksdb

13 7月, 2017 1 次提交

Add virtual func IsDeleteRangeSupported · 70440f7a

由奏之章提交于 7月 12, 2017

Summary:
this modify allows third-party tables able to support delete range
Closes https://github.com/facebook/rocksdb/pull/2035

Differential Revision: D5407973

Pulled By: ajkr

fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418

70440f7a

08 7月, 2017 1 次提交

Temporarily disable FIFOCompactionWithTTLTest · 56656e12

由 Sagar Vemuri 提交于 7月 07, 2017

Summary:
FIFOCompactionWithTTLTests are flaky when run in parallel, as there is a time element involved to it. Temporarily disabling them while I investigate a more robust testing solution like, say, mocking time.
Closes https://github.com/facebook/rocksdb/pull/2548

Differential Revision: D5386084

Pulled By: sagar0

fbshipit-source-id: 262886b25bdf091021d8553e780443a985e9bac4

56656e12

06 7月, 2017 1 次提交

Fix GetCurrentTime() initialization for valgrind · 33042573

由 Andrew Kryczka 提交于 7月 05, 2017

Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526

Differential Revision: D5358689

Pulled By: ajkr

fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42

33042573

01 7月, 2017 2 次提交

Update the AddDBStats in LITE · 7604b463

由 Maysam Yabandeh 提交于 6月 30, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/2525

Differential Revision: D5356859

Pulled By: maysamyabandeh

fbshipit-source-id: f593adad2a8aab12dcd6ab25db076eca51d30d34

7604b463

Simplify and document sync rules for logs_ etc · 1e34d07e

由 Maysam Yabandeh 提交于 6月 30, 2017

Summary:
Adding/Correcting inline comments and clarify the sync rules. To make it simple to reason, the rules are a big general which ended up to some extra synchronizations. However such synchronizations are not on the fast path, and they are worth the simplicity.
Closes https://github.com/facebook/rocksdb/pull/2517

Differential Revision: D5348239

Pulled By: maysamyabandeh

fbshipit-source-id: ff2e59fb1e568c122d2cdbf598310f3613b7d212

1e34d07e

30 6月, 2017 3 次提交

Regression test for empty dedicated range deletion file · d310e0f3

由 Andrew Kryczka 提交于 6月 30, 2017

Summary:
Issue: #2478
Fix: #2503

The bug happened when all of these conditions were satisfied:

- A subcompaction generates no keys
- `RangeDelAggregator::ShouldAddTombstones()` returns true because there's at least one non-obsoleted range deletion in its map
- None of the non-obsolete tombstones overlap with the subcompaction key-range

Under those conditions, we were creating a dedicated file for range deletions which was left empty, thus causing an error in VersionEdit.

I verified this test case fails before the #2503 fix and passes after.
Closes https://github.com/facebook/rocksdb/pull/2521

Differential Revision: D5352568

Pulled By: ajkr

fbshipit-source-id: f619cae39984ce9bb9b7a4e7a9ac0f2bb2ce43e9

d310e0f3

Add a fetch_add variation to AddDBStats · e9f91a51

由 Maysam Yabandeh 提交于 6月 29, 2017

Summary:
AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.

This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
- vanilla: 68618 tps
- removing mutex on AddDBStats (unsafe): 69767 tps
- fetch_add for all AddDBStats: 69200 tps
- fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
Closes https://github.com/facebook/rocksdb/pull/2505

Differential Revision: D5330656

Pulled By: maysamyabandeh

fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383

e9f91a51

skip generating empty sst · c1b375e9

由 zhangjinpeng1987 提交于 6月 29, 2017

Summary:
When a compaction job output nothing, there is no necessary to generate a empty sst file which will cause `VersionEdit::EncodeTo` failed.
ref https://github.com/facebook/rocksdb/issues/2478
Closes https://github.com/facebook/rocksdb/pull/2503

Differential Revision: D5350799

Pulled By: ajkr

fbshipit-source-id: df0b4fcf3507fe1c3c435208b762e75478e00143

c1b375e9

29 6月, 2017 3 次提交

Improve Status message for block checksum mismatches · 397ab111

由 Mike Kolupaev 提交于 6月 28, 2017

Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.

It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507

Differential Revision: D5345702

Pulled By: al13n321

fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339

397ab111

Make "make analyze" happy · 18c63af6

由 Siying Dong 提交于 6月 28, 2017

Summary:
"make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way.
Closes https://github.com/facebook/rocksdb/pull/2508

Differential Revision: D5341710

Pulled By: siying

fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6

18c63af6

Fix the reported asan issues · 01534db2

由 Maysam Yabandeh 提交于 6月 28, 2017

Summary:
This is to resolve the asan complains. In the meanwhile I am working on clarifying/revisiting the sync rules.
Closes https://github.com/facebook/rocksdb/pull/2510

Differential Revision: D5338660

Pulled By: yiwu-arbug

fbshipit-source-id: ce6f6e0826d43a2c0bfa4328a00c78f73cd6498a

01534db2

28 6月, 2017 1 次提交

FIFO Compaction with TTL · 1cd45cd1

由 Sagar Vemuri 提交于 6月 27, 2017

Summary:
Introducing FIFO compactions with TTL.

FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.

To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
  - On Flush: Set to the time of flush.
  - On Compaction: Set to the max creation_time of all the files involved in the compaction.
  - On Repair and Recovery: Set to the time of repair/recovery.
  - Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
  - the creation_time of all files involved in compaction is 0.
  - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.

This feature is not supported if max_open_files != -1 or with table formats other than Block-based.

**Test Plan:**
Added tests.

**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100

readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
```

With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20

readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```

SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480

Differential Revision: D5305116

Pulled By: sagar0

fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174

1cd45cd1

27 6月, 2017 4 次提交

Fix Windows build broken by · 89468c01

由 Siying Dong 提交于 6月 26, 2017

Summary:
A typo conversion fails Windows build. Fix it.
Closes https://github.com/facebook/rocksdb/pull/2500

Differential Revision: D5325962

Pulled By: siying

fbshipit-source-id: 2cefdafc9afbc85f856f403af7c876b622400630

89468c01

Encryption at rest support · 51778612

由 Ewout Prangsma 提交于 6月 26, 2017

Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.

It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).

The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.

To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:

```
ENCRYPTED_ENV=1 make check_some
```

There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424

Differential Revision: D5322178

Pulled By: sdwilsh

fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f

51778612

Unit Tests for sync, range sync and file close failures · 5c97a7c0

由 Siying Dong 提交于 6月 26, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/2454

Differential Revision: D5255320

Pulled By: siying

fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913

5c97a7c0

Fix bug that flush doesn't respond to fsync result · d757355c

由 Siying Dong 提交于 6月 26, 2017

Summary:
With a regression bug was introduced two years ago, by https://github.com/facebook/rocksdb/commit/6e9fbeb27c38329f33ae541302c44c8db8374f8c , we fail to check return status of fsync call. This can cause we miss the information from the file system and can potentially cause corrupted data which we could have been detected.
Closes https://github.com/facebook/rocksdb/pull/2495

Reviewed By: ajkr

Differential Revision: D5321949

Pulled By: siying

fbshipit-source-id: c68117914bb40700198fc37d0e4c63163a8a1031

d757355c

25 6月, 2017 2 次提交

Update rename of ParanoidCheck · 8e6345d2

由 Maysam Yabandeh 提交于 6月 24, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/2494

Differential Revision: D5317902

Pulled By: maysamyabandeh

fbshipit-source-id: 097330292180816b3d0c9f4cbbdb6f68f0180200

8e6345d2

Optimize for serial commits in 2PC · 499ebb3a

由 Maysam Yabandeh 提交于 6月 24, 2017

Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345

Differential Revision: D5210732

Pulled By: maysamyabandeh

fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4

499ebb3a

23 6月, 2017 2 次提交

Introduce OnBackgroundError callback · 71f5bcb7

由 Andrew Kryczka 提交于 6月 22, 2017

Summary:
Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it.

- call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status.
- classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened
- renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason`
- unit tests for the most common cases: flush or compaction errors
Closes https://github.com/facebook/rocksdb/pull/2477

Differential Revision: D5300190

Pulled By: ajkr

fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3

71f5bcb7

Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty() · 6837a176

由 Siying Dong 提交于 6月 22, 2017

Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.

Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475

Differential Revision: D5298053

Pulled By: siying

fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6

6837a176

21 6月, 2017 2 次提交

Fix flush assertion with tsan · 9467eb61

由 Andrew Kryczka 提交于 6月 20, 2017

Summary:
DBImpl's instance variables should only be accessed with mutex held. I moved an assert later to uphold this rule.

DBTest.LastWriteBufferDelay test was sporadically failing TSAN because it tried to flush around the same time the db was destroyed, so the variable was accessed simultaneously by two threads.
Closes https://github.com/facebook/rocksdb/pull/2471

Differential Revision: D5286857

Pulled By: ajkr

fbshipit-source-id: 435abd84efa601f667c254e320b0bb5a434b971f

9467eb61

Implement ReopenWritibaleFile on Windows and other fixes · a21db161

由 Dmitri Smirnov 提交于 6月 20, 2017

Summary:
Make default impl return NoSupported so the db_blob
  tests exist in a meaningful manner.
  Replace std::thread to port::Thread
Closes https://github.com/facebook/rocksdb/pull/2465

Differential Revision: D5275563

Pulled By: yiwu-arbug

fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6

a21db161

20 6月, 2017 1 次提交

fix coredump for release nullptr · c430d69e

由 zhangjinpeng1987 提交于 6月 19, 2017

Summary:
Coredump will be triggered when ingest external sst file after delete range.
ref https://github.com/facebook/rocksdb/issues/2398
Closes https://github.com/facebook/rocksdb/pull/2463

Differential Revision: D5275599

Pulled By: ajkr

fbshipit-source-id: 0828dbc062ea8c74e913877cd63494fd3478a30d

c430d69e

14 6月, 2017 2 次提交

fixed typo · 6b5a5dc5

由 hyunwoo 提交于 6月 13, 2017

Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2430

Differential Revision: D5242471

Pulled By: IslamAbdelRahman

fbshipit-source-id: 832eb3a4c70221444ccd2ae63217823fec56c748

6b5a5dc5

Call RateLimiter for compaction reads · c217e0b9

由 Andrew Kryczka 提交于 6月 13, 2017

Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433

Differential Revision: D5216946

Pulled By: ajkr

fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e

c217e0b9

13 6月, 2017 2 次提交

Fix Clang release build broken by · 5d5a28a9

由 Siying Dong 提交于 6月 13, 2017

Summary:
5582123d broken CLANG release build because of an unexpected change. Fix it.
Closes https://github.com/facebook/rocksdb/pull/2443

Differential Revision: D5236297

Pulled By: siying

fbshipit-source-id: 1b410adf13ded149c53e8235e9ea9f3130fb5403

5d5a28a9

Limit trash directory to be 25% of total DB · d713471d

由 Islam AbdelRahman 提交于 6月 12, 2017

Summary:
Update DeleteScheduler to delete files immediately if trash directory is >= 25% of DB size
Closes https://github.com/facebook/rocksdb/pull/2436

Differential Revision: D5230384

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5cbda8ac536a3cc72c774641621edc02c8202482

d713471d

12 6月, 2017 2 次提交

Sample number of reads per SST file · 5582123d

由 Siying Dong 提交于 6月 12, 2017

Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417

Differential Revision: D5193528

Pulled By: siying

fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df

5582123d

Fix RocksDB Lite build with CLANG · db818d2d

由 Siying Dong 提交于 6月 12, 2017

Summary: Closes https://github.com/facebook/rocksdb/pull/2419

Differential Revision: D5193976

Pulled By: siying

fbshipit-source-id: 62d115edee6043237e9d6ad3c2a05481e162c9eb

db818d2d

09 6月, 2017 1 次提交

Disable DBRangeDelTest::TailingIteratorRangeTombstoneUnsupported for ubsan · 85dace2a

由 Yi Wu 提交于 6月 08, 2017

Summary:
UBSAN crashes when it run the test. Disabling it for UBSAN.
Closes https://github.com/facebook/rocksdb/pull/2427

Differential Revision: D5210897

Pulled By: yiwu-arbug

fbshipit-source-id: 2f5a876807c98d8db79ab9581965f7e6b29d4163

85dace2a

06 6月, 2017 3 次提交

WriteOptions.low_pri which can throttle low pri writes if needed · 52a7f38b

由 Siying Dong 提交于 6月 05, 2017

Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369

Differential Revision: D5127619

Pulled By: siying

fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0

52a7f38b

Fix db_write_test clang/windows build failure · dba9f372

由 Yi Wu 提交于 6月 05, 2017

Summary:
Fix db_write_test clang/windows build failure. Explicitly cast size_t (unsigned long) to uint32_t (unsigned int).
Closes https://github.com/facebook/rocksdb/pull/2407

Differential Revision: D5182995

Pulled By: yiwu-arbug

fbshipit-source-id: aba225a9fccb12d5bfbdc2cd6efc11040706a9d2

dba9f372

fixed typo · c7662a44

由 hyunwoo 提交于 6月 05, 2017

Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2376

Differential Revision: D5183630

Pulled By: ajkr

fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5

c7662a44

03 6月, 2017 5 次提交

using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… · 7f6c02dd

由 Aaron Gao 提交于 6月 02, 2017

Summary:
… headers

https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.

We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.

make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380

Differential Revision: D5177896

Pulled By: lightmark

fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390

7f6c02dd

Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt… · 138b87ea

由 Mike Kolupaev 提交于 6月 02, 2017

Summary:
Fixes the following scenario:
1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
2. Do a compaction.
3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
4. At some point compaction filter returns `kRemoveAndSkipUntil`.
5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.

The fix is to make compaction iterator use `total_order_seek = true`.

The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
Closes https://github.com/facebook/rocksdb/pull/2349

Differential Revision: D5110388

Pulled By: lightmark

fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a

138b87ea

Improve write buffer manager (and allow the size to be tracked in block cache) · 95b0e89b

由 Siying Dong 提交于 6月 02, 2017

Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017

95b0e89b

Pass CF ID to MemTableRepFactory · a4d9c025

由 Andrew Kryczka 提交于 6月 02, 2017

Summary:
Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:

- adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
- updates MemTable to create MemTableRep's using the new overload.
Closes https://github.com/facebook/rocksdb/pull/2346

Differential Revision: D5108061

Pulled By: ajkr

fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616

a4d9c025

Fix DBWriteTest::ReturnSequenceNumberMultiThreaded data race · f68d88be

由 Yi Wu 提交于 6月 02, 2017

Summary:
rocksdb::Random is not thread-safe. Have one Random for each thread instead.
Closes https://github.com/facebook/rocksdb/pull/2400

Differential Revision: D5173919

Pulled By: yiwu-arbug

fbshipit-source-id: 1a99c7b877f3893eb22355af49e321bcad4e53e6

f68d88be

02 6月, 2017 2 次提交

Fix TSAN: avoid arena mode with range deletions · 215076ef

由 Andrew Kryczka 提交于 6月 01, 2017

Summary:
The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality.

Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated.
Closes https://github.com/facebook/rocksdb/pull/2399

Differential Revision: D5171119

Pulled By: ajkr

fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8

215076ef

account for L0 size in estimated compaction bytes · 3a8a848a

由 Andrew Kryczka 提交于 6月 01, 2017

Summary:
also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0.
Closes https://github.com/facebook/rocksdb/pull/2179

Differential Revision: D4915772

Pulled By: ajkr

fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa

3a8a848a

kvdb / rocksdb 12 个月 前同步成功

kvdb / rocksdb
12 个月前同步成功