1. 13 7月, 2017 1 次提交
  2. 08 7月, 2017 1 次提交
  3. 06 7月, 2017 1 次提交
    • A
      Fix GetCurrentTime() initialization for valgrind · 33042573
      Andrew Kryczka 提交于
      Summary:
      Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
      Closes https://github.com/facebook/rocksdb/pull/2526
      
      Differential Revision: D5358689
      
      Pulled By: ajkr
      
      fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
      33042573
  4. 01 7月, 2017 2 次提交
  5. 30 6月, 2017 3 次提交
    • A
      Regression test for empty dedicated range deletion file · d310e0f3
      Andrew Kryczka 提交于
      Summary:
      Issue: #2478
      Fix: #2503
      
      The bug happened when all of these conditions were satisfied:
      
      - A subcompaction generates no keys
      - `RangeDelAggregator::ShouldAddTombstones()` returns true because there's at least one non-obsoleted range deletion in its map
      - None of the non-obsolete tombstones overlap with the subcompaction key-range
      
      Under those conditions, we were creating a dedicated file for range deletions which was left empty, thus causing an error in VersionEdit.
      
      I verified this test case fails before the #2503 fix and passes after.
      Closes https://github.com/facebook/rocksdb/pull/2521
      
      Differential Revision: D5352568
      
      Pulled By: ajkr
      
      fbshipit-source-id: f619cae39984ce9bb9b7a4e7a9ac0f2bb2ce43e9
      d310e0f3
    • M
      Add a fetch_add variation to AddDBStats · e9f91a51
      Maysam Yabandeh 提交于
      Summary:
      AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.
      
      This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
      - vanilla: 68618 tps
      - removing mutex on AddDBStats (unsafe): 69767 tps
      - fetch_add for all AddDBStats: 69200 tps
      - fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
      Closes https://github.com/facebook/rocksdb/pull/2505
      
      Differential Revision: D5330656
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
      e9f91a51
    • Z
      skip generating empty sst · c1b375e9
      zhangjinpeng1987 提交于
      Summary:
      When a compaction job output nothing, there is no necessary to generate a empty sst file which will cause `VersionEdit::EncodeTo` failed.
      ref https://github.com/facebook/rocksdb/issues/2478
      Closes https://github.com/facebook/rocksdb/pull/2503
      
      Differential Revision: D5350799
      
      Pulled By: ajkr
      
      fbshipit-source-id: df0b4fcf3507fe1c3c435208b762e75478e00143
      c1b375e9
  6. 29 6月, 2017 3 次提交
    • M
      Improve Status message for block checksum mismatches · 397ab111
      Mike Kolupaev 提交于
      Summary:
      We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.
      
      It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
      Closes https://github.com/facebook/rocksdb/pull/2507
      
      Differential Revision: D5345702
      
      Pulled By: al13n321
      
      fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
      397ab111
    • S
      Make "make analyze" happy · 18c63af6
      Siying Dong 提交于
      Summary:
      "make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way.
      Closes https://github.com/facebook/rocksdb/pull/2508
      
      Differential Revision: D5341710
      
      Pulled By: siying
      
      fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6
      18c63af6
    • M
      Fix the reported asan issues · 01534db2
      Maysam Yabandeh 提交于
      Summary:
      This is to resolve the asan complains. In the meanwhile I am working on clarifying/revisiting the sync rules.
      Closes https://github.com/facebook/rocksdb/pull/2510
      
      Differential Revision: D5338660
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ce6f6e0826d43a2c0bfa4328a00c78f73cd6498a
      01534db2
  7. 28 6月, 2017 1 次提交
    • S
      FIFO Compaction with TTL · 1cd45cd1
      Sagar Vemuri 提交于
      Summary:
      Introducing FIFO compactions with TTL.
      
      FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
      
      To address that request:
      - Added a new TTL option to FIFO compaction options.
      - Updated FIFO compaction score to take TTL into consideration.
      - Added a new table property, creation_time, to keep track of when the SST file is created.
      - Creation_time is set as below:
        - On Flush: Set to the time of flush.
        - On Compaction: Set to the max creation_time of all the files involved in the compaction.
        - On Repair and Recovery: Set to the time of repair/recovery.
        - Old files created prior to this code change will have a creation_time of 0.
      - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
      - FIFO compaction will fall back to the prior way of deleting files based on size if:
        - the creation_time of all files involved in compaction is 0.
        - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
      
      This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
      
      **Test Plan:**
      Added tests.
      
      **Benchmark results:**
      Base: FIFO with max size: 100MB ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
      
      readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
      ```
      
      With TTL (a low one for testing) ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
      
      readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
      ```
      Example Log lines:
      ```
      2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
      2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
      ...
      2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
      2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
      ```
      
      SST Files remaining in the dbbench dir, after db_bench execution completed:
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
      -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
      -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
      -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
      ```
      Closes https://github.com/facebook/rocksdb/pull/2480
      
      Differential Revision: D5305116
      
      Pulled By: sagar0
      
      fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
      1cd45cd1
  8. 27 6月, 2017 4 次提交
    • S
      Fix Windows build broken by 5c97a7c0 · 89468c01
      Siying Dong 提交于
      Summary:
      A typo conversion fails Windows build. Fix it.
      Closes https://github.com/facebook/rocksdb/pull/2500
      
      Differential Revision: D5325962
      
      Pulled By: siying
      
      fbshipit-source-id: 2cefdafc9afbc85f856f403af7c876b622400630
      89468c01
    • E
      Encryption at rest support · 51778612
      Ewout Prangsma 提交于
      Summary:
      This PR adds support for encrypting data stored by RocksDB when written to disk.
      
      It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
      The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
      Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).
      
      The Counter operation mode uses an initial counter & random initialization vector (IV).
      Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
      The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.
      
      To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
      Typically you would run it like this:
      
      ```
      ENCRYPTED_ENV=1 make check_some
      ```
      
      There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
      Closes https://github.com/facebook/rocksdb/pull/2424
      
      Differential Revision: D5322178
      
      Pulled By: sdwilsh
      
      fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
      51778612
    • S
      Unit Tests for sync, range sync and file close failures · 5c97a7c0
      Siying Dong 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/2454
      
      Differential Revision: D5255320
      
      Pulled By: siying
      
      fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913
      5c97a7c0
    • S
      Fix bug that flush doesn't respond to fsync result · d757355c
      Siying Dong 提交于
      Summary:
      With a regression bug was introduced two years ago, by https://github.com/facebook/rocksdb/commit/6e9fbeb27c38329f33ae541302c44c8db8374f8c , we fail to check return status of fsync call. This can cause we miss the information from the file system and can potentially cause corrupted data which we could have been detected.
      Closes https://github.com/facebook/rocksdb/pull/2495
      
      Reviewed By: ajkr
      
      Differential Revision: D5321949
      
      Pulled By: siying
      
      fbshipit-source-id: c68117914bb40700198fc37d0e4c63163a8a1031
      d757355c
  9. 25 6月, 2017 2 次提交
    • M
      Update rename of ParanoidCheck · 8e6345d2
      Maysam Yabandeh 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/2494
      
      Differential Revision: D5317902
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 097330292180816b3d0c9f4cbbdb6f68f0180200
      8e6345d2
    • M
      Optimize for serial commits in 2PC · 499ebb3a
      Maysam Yabandeh 提交于
      Summary:
      Throughput: 46k tps in our sysbench settings (filling the details later)
      
      The idea is to have the simplest change that gives us a reasonable boost
      in 2PC throughput.
      
      Major design changes:
      1. The WAL file internal buffer is not flushed after each write. Instead
      it is flushed before critical operations (WAL copy via fs) or when
      FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
      via mutex_.
      2. Use two sequence numbers: last seq, and last seq for write. Last seq
      is the last visible sequence number for reads. Last seq for write is the
      next sequence number that should be used to write to WAL/memtable. This
      allows to have a memtable write be in parallel to WAL writes.
      3. BatchGroup is not used for writes. This means that we can have
      parallel writers which changes a major assumption in the code base. To
      accommodate for that i) allow only 1 WriteImpl that intends to write to
      memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
      come via group commit phase which is serial anyway, ii) make all the
      parts in the code base that assumed to be the only writer (via
      EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
      protected via a stat_mutex_.
      
      Note: the first commit has the approach figured out but is not clean.
      Submitting the PR anyway to get the early feedback on the approach. If
      we are ok with the approach I will go ahead with this updates:
      0) Rebase with Yi's pipelining changes
      1) Currently batching is disabled by default to make sure that it will be
      consistent with all unit tests. Will make this optional via a config.
      2) A couple of unit tests are disabled. They need to be updated with the
      serial commit of 2PC taken into account.
      3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
      releasing mutex_ beforehand (the same way EnterUnbatched does). This
      needs to be cleaned up.
      Closes https://github.com/facebook/rocksdb/pull/2345
      
      Differential Revision: D5210732
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
      499ebb3a
  10. 23 6月, 2017 2 次提交
    • A
      Introduce OnBackgroundError callback · 71f5bcb7
      Andrew Kryczka 提交于
      Summary:
      Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it.
      
      - call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status.
      - classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened
      - renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason`
      - unit tests for the most common cases: flush or compaction errors
      Closes https://github.com/facebook/rocksdb/pull/2477
      
      Differential Revision: D5300190
      
      Pulled By: ajkr
      
      fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3
      71f5bcb7
    • S
      Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty() · 6837a176
      Siying Dong 提交于
      Summary:
      CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.
      
      Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
      Closes https://github.com/facebook/rocksdb/pull/2475
      
      Differential Revision: D5298053
      
      Pulled By: siying
      
      fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
      6837a176
  11. 21 6月, 2017 2 次提交
  12. 20 6月, 2017 1 次提交
  13. 14 6月, 2017 2 次提交
  14. 13 6月, 2017 2 次提交
  15. 12 6月, 2017 2 次提交
  16. 09 6月, 2017 1 次提交
  17. 06 6月, 2017 3 次提交
  18. 03 6月, 2017 5 次提交
    • A
      using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… · 7f6c02dd
      Aaron Gao 提交于
      Summary:
      … headers
      
      https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
      
      We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
      
      make check -j64
      Closes https://github.com/facebook/rocksdb/pull/2380
      
      Differential Revision: D5177896
      
      Pulled By: lightmark
      
      fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
      7f6c02dd
    • M
      Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt… · 138b87ea
      Mike Kolupaev 提交于
      Summary:
      Fixes the following scenario:
       1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
       2. Do a compaction.
       3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
       4. At some point compaction filter returns `kRemoveAndSkipUntil`.
       5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.
      
      The fix is to make compaction iterator use `total_order_seek = true`.
      
      The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
      Closes https://github.com/facebook/rocksdb/pull/2349
      
      Differential Revision: D5110388
      
      Pulled By: lightmark
      
      fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a
      138b87ea
    • S
      Improve write buffer manager (and allow the size to be tracked in block cache) · 95b0e89b
      Siying Dong 提交于
      Summary:
      Improve write buffer manager in several ways:
      1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
      2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
      3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
      Closes https://github.com/facebook/rocksdb/pull/2350
      
      Differential Revision: D5110648
      
      Pulled By: siying
      
      fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
      95b0e89b
    • A
      Pass CF ID to MemTableRepFactory · a4d9c025
      Andrew Kryczka 提交于
      Summary:
      Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:
      
      - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
      - updates MemTable to create MemTableRep's using the new overload.
      Closes https://github.com/facebook/rocksdb/pull/2346
      
      Differential Revision: D5108061
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
      a4d9c025
    • Y
      Fix DBWriteTest::ReturnSequenceNumberMultiThreaded data race · f68d88be
      Yi Wu 提交于
      Summary:
      rocksdb::Random is not thread-safe. Have one Random for each thread instead.
      Closes https://github.com/facebook/rocksdb/pull/2400
      
      Differential Revision: D5173919
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 1a99c7b877f3893eb22355af49e321bcad4e53e6
      f68d88be
  19. 02 6月, 2017 2 次提交
    • A
      Fix TSAN: avoid arena mode with range deletions · 215076ef
      Andrew Kryczka 提交于
      Summary:
      The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality.
      
      Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated.
      Closes https://github.com/facebook/rocksdb/pull/2399
      
      Differential Revision: D5171119
      
      Pulled By: ajkr
      
      fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8
      215076ef
    • A
      account for L0 size in estimated compaction bytes · 3a8a848a
      Andrew Kryczka 提交于
      Summary:
      also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0.
      Closes https://github.com/facebook/rocksdb/pull/2179
      
      Differential Revision: D4915772
      
      Pulled By: ajkr
      
      fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa
      3a8a848a