1. 05 5月, 2018 2 次提交
  2. 04 5月, 2018 4 次提交
    • Z
      MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction · a7034328
      Zhongyi Xie 提交于
      Summary:
      `MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used.
      This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`)
      Fixes https://github.com/facebook/rocksdb/issues/3229
      
      In master:
      
      ```
      Level Files Size(MB)
      --------------------
        0       14      846
        1        0        0
        2        0        0
        3        0        0
        4        0        0
        5       15      366
        6       11      481
      Cumulative compaction: 3.83 GB write, 2.27 GB read
      ```
      In branch:
      ```
      Level Files Size(MB)
      --------------------
        0        9      544
        1        0        0
        2        0        0
        3        0        0
        4        0        0
        5        0        0
        6      445      935
      Cumulative compaction: 2.91 GB write, 1.46 GB read
      ```
      
      db_bench command used:
      ```
      ./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2
      ```
      Closes https://github.com/facebook/rocksdb/pull/3755
      
      Differential Revision: D7721381
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8
      a7034328
    • D
      Better destroydb · 934f96de
      Dmitri Smirnov 提交于
      Summary:
      Delete archive directory before WAL folder
        since archive may be contained as a subfolder.
        Also improve loop readability.
      Closes https://github.com/facebook/rocksdb/pull/3797
      
      Differential Revision: D7866378
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5
      934f96de
    • M
      Speedup ManualCompactionTest.Test · a8d77ca3
      Maysam Yabandeh 提交于
      Summary:
      ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test.
      Closes https://github.com/facebook/rocksdb/pull/3802
      
      Differential Revision: D7865596
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785
      a8d77ca3
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  3. 03 5月, 2018 2 次提交
    • M
      WritePrepared Txn: enable rollback in stress test · cfb86659
      Maysam Yabandeh 提交于
      Summary:
      Rollback was disabled in stress test since there was a concurrency issue in WritePrepared rollback algorithm. The issue is fixed by caching the column family handles in WritePrepared to skip getting them from the db when needed for rollback.
      
      Tested by running transaction stress test under tsan.
      Closes https://github.com/facebook/rocksdb/pull/3785
      
      Differential Revision: D7793727
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: d81ab6fda0e53186ca69944cfe0712ce4869451e
      cfb86659
    • M
      WritePrepared Txn: split SeqAdvanceConcurrentTest · 5bed8a00
      Maysam Yabandeh 提交于
      Summary:
      The tsan flavor of SeqAdvanceConcurrentTest times out in our test infra. The patch splits it into 10 tests.
      On my vm before:
      [       OK ] WritePreparedTransactionTest/WritePreparedTransactionTest.SeqAdvanceConcurrentTest/0 (5194 ms)
      after:
      [       OK ] OneWriteQueue/SeqAdvanceConcurrentTest.SeqAdvanceConcurrentTest/0 (1906 ms)
      Closes https://github.com/facebook/rocksdb/pull/3799
      
      Differential Revision: D7854515
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 4fbac42a1f974326cbc237f8cb9d6232d379c431
      5bed8a00
  4. 02 5月, 2018 3 次提交
  5. 01 5月, 2018 2 次提交
    • A
      Second attempt at db_stress crash-recovery verification · 46152d53
      Andrew Kryczka 提交于
      Summary:
      - Original commit: a4fb1f8c
      - Revert commit (we reverted as a quick fix to get crash tests passing): 6afe22db
      
      This PR includes the contents of the original commit plus two bug fixes, which are:
      
      - In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs.
      - Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported.
      Closes https://github.com/facebook/rocksdb/pull/3793
      
      Differential Revision: D7811671
      
      Pulled By: ajkr
      
      fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30
      46152d53
    • V
      fix missing perfcontext destroy declare in C API · 282099fc
      Vincent Lee 提交于
      Summary:
      `rocksdb_perfcontext_destroy` declare is missing in C API.
      Closes https://github.com/facebook/rocksdb/pull/3787
      
      Differential Revision: D7816490
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3a488607bfc897c7ce846a1b3c2b7af693134d0d
      282099fc
  6. 28 4月, 2018 5 次提交
  7. 27 4月, 2018 7 次提交
    • Y
      Rename pending_flush_ to queued_for_flush_. · 513b5ce6
      Yanqin Jin 提交于
      Summary:
      With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush
      
      ```
      if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
      ...
      }
      ```
      
      `Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
      especially for the sake of readability.
      Closes https://github.com/facebook/rocksdb/pull/3777
      
      Differential Revision: D7783066
      
      Pulled By: riversand963
      
      fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229
      513b5ce6
    • N
      Add virtual Truncate method to Env · 37cd617b
      Nathan VanBenschoten 提交于
      Summary:
      This change adds a virtual `Truncate` method to `Env`, which truncates
      the named file to the specified size. At the moment, this is only
      supported for `MockEnv`, but other `Env's` could be extended to override
      the method too. This is the same approach that methods like `LinkFile` and
      `AreSameFile` have taken.
      
      This is useful for any user of the in-memory `Env`. The implementation's
      header is not exported, so before this change, it was impossible to
      access it's already existing `Truncate` method.
      Closes https://github.com/facebook/rocksdb/pull/3779
      
      Differential Revision: D7785789
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3bcdaeea7b7180529f7d9b496dc67b791a00bbf0
      37cd617b
    • A
      Allow options file in db_stress and db_crashtest · db36f222
      Andrew Kryczka 提交于
      Summary:
      - When options file is provided to db_stress, take supported options from the file instead of from flags
      - Call `BuildOptionsTable` after `Open` so it can use `options_` once it has been populated either from flags or from file
      - Allow options filename to be passed via `db_crashtest.py`
      Closes https://github.com/facebook/rocksdb/pull/3768
      
      Differential Revision: D7755331
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5205cc5deb0d74d677b9832174153812bab9a60a
      db36f222
    • A
      Remove block-based table assertion for non-empty filter block · 7004e454
      Andrew Kryczka 提交于
      Summary:
      7a6353bd prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the 7a6353bd patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe.
      Closes https://github.com/facebook/rocksdb/pull/3773
      
      Differential Revision: D7769964
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3
      7004e454
    • S
      Sync parent directory after deleting a file in delete scheduler · 63c965cd
      Siying Dong 提交于
      Summary:
      sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
      Closes https://github.com/facebook/rocksdb/pull/3767
      
      Differential Revision: D7760136
      
      Pulled By: siying
      
      fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
      63c965cd
    • M
      Fix the bloom filter skipping empty prefixes · 7e4e3814
      Maysam Yabandeh 提交于
      Summary:
      bc0da4b5 optimized bloom filters by skipping duplicate entires when the whole key and prefixes are both added to the bloom. It however used empty string as the initial value of the last entry added to the bloom. This is incorrect since empty key/prefix are valid entires by themselves. This patch fixes that.
      Closes https://github.com/facebook/rocksdb/pull/3776
      
      Differential Revision: D7778803
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: d5a065daebee17f9403cac51e9d5626aac87bfbc
      7e4e3814
    • M
      WritePrepared Txn: disable rollback in stress test · e5a4dacf
      Maysam Yabandeh 提交于
      Summary:
      WritePrepared rollback implementation is not ready to be invoked in the middle of workload. This is due the lack of synchronization to obtain the cf handle from db. Temporarily disabling this until the problem with rollback is fixed.
      Closes https://github.com/facebook/rocksdb/pull/3772
      
      Differential Revision: D7769041
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 0e3b0ce679bc2afba82e653a40afa3f045722754
      e5a4dacf
  8. 26 4月, 2018 5 次提交
  9. 25 4月, 2018 2 次提交
    • A
      Add crash-recovery correctness check to db_stress · a4fb1f8c
      Andrew Kryczka 提交于
      Summary:
      Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.
      
      In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.
      
      For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.
      
      `db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
      Closes https://github.com/facebook/rocksdb/pull/3629
      
      Differential Revision: D7463144
      
      Pulled By: ajkr
      
      fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
      a4fb1f8c
    • M
      Skip duplicate bloom keys when whole_key and prefix are mixed · bc0da4b5
      Maysam Yabandeh 提交于
      Summary:
      Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate.
      Closes https://github.com/facebook/rocksdb/pull/3764
      
      Differential Revision: D7744413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8
      bc0da4b5
  10. 24 4月, 2018 3 次提交
    • G
      Support lowering CPU priority of background threads · 090c78a0
      Gabriel Wicke 提交于
      Summary:
      Background activities like compaction can negatively affect
      latency of higher-priority tasks like request processing. To avoid this,
      rocksdb already lowers the IO priority of background threads on Linux
      systems. While this takes care of typical IO-bound systems, it does not
      help much when CPU (temporarily) becomes the bottleneck. This is
      especially likely when using more expensive compression settings.
      
      This patch adds an API to allow for lowering the CPU priority of
      background threads, modeled on the IO priority API. Benchmarks (see
      below) show significant latency and throughput improvements when CPU
      bound. As a result, workloads with some CPU usage bursts should benefit
      from lower latencies at a given utilization, or should be able to push
      utilization higher at a given request latency target.
      
      A useful side effect is that compaction CPU usage is now easily visible
      in common tools, allowing for an easier estimation of the contribution
      of compaction vs. request processing threads.
      
      As with IO priority, the implementation is limited to Linux, degrading
      to a no-op on other systems.
      Closes https://github.com/facebook/rocksdb/pull/3763
      
      Differential Revision: D7740096
      
      Pulled By: gwicke
      
      fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
      090c78a0
    • M
      Improve write time breakdown stats · affe01b0
      Mike Kolupaev 提交于
      Summary:
      There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.
      
      These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
      Closes https://github.com/facebook/rocksdb/pull/3602
      
      Differential Revision: D7251562
      
      Pulled By: al13n321
      
      fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
      affe01b0
    • S
      Revert "Skip deleted WALs during recovery" · d5afa737
      Siying Dong 提交于
      Summary:
      This reverts commit 73f21a7b.
      
      It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:
      
      "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"
      
      This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
      Closes https://github.com/facebook/rocksdb/pull/3762
      
      Differential Revision: D7730035
      
      Pulled By: siying
      
      fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
      d5afa737
  11. 21 4月, 2018 5 次提交
    • A
      Avoid directory renames in BackupEngine · a8a28da2
      Andrew Kryczka 提交于
      Summary:
      We used to name private directories like "1.tmp" while BackupEngine populated them, and then rename without the ".tmp" suffix (i.e., rename "1.tmp" to "1") after all files were copied. On glusterfs, directory renames like this require operations across many hosts, and partial failures have caused operational problems.
      
      Fortunately we don't need to rename private directories. We already have a meta-file that uses the tempfile-rename pattern to commit a backup atomically after all its files have been successfully copied. So we can copy private files directly to their final location, so now there's no directory rename.
      Closes https://github.com/facebook/rocksdb/pull/3749
      
      Differential Revision: D7705610
      
      Pulled By: ajkr
      
      fbshipit-source-id: fd724a28dd2bf993ce323a5f2cb7e7d6980cc346
      a8a28da2
    • Y
      Disable EnvPosixTest::FilePermission · 2e72a589
      Yi Wu 提交于
      Summary:
      The test is flaky in our CI but could not be reproduce manually on the same CI host. Disabling it.
      Closes https://github.com/facebook/rocksdb/pull/3753
      
      Differential Revision: D7716320
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 6bed3b05880c1d24e8dc86bc970e5181bc98fb45
      2e72a589
    • M
      WritePrepared Txn: rollback via commit · bb2a2ec7
      Maysam Yabandeh 提交于
      Summary:
      Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap.
      This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache.
      The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction.
      Closes https://github.com/facebook/rocksdb/pull/3745
      
      Differential Revision: D7696193
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c
      bb2a2ec7
    • A
      Add a stat for MultiGet keys found, update memtable hit/miss stats · dbdaa466
      Anand Ananthabhotla 提交于
      Summary:
      1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
      number of keys successfully read
      2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
      DBImpl::GetImpl(), but not MultiGet
      Closes https://github.com/facebook/rocksdb/pull/3730
      
      Differential Revision: D7677364
      
      Pulled By: anand1976
      
      fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
      dbdaa466
    • M
      WritePrepared Txn: enable TryAgain for duplicates at the end of the batch · c3d1e36c
      Maysam Yabandeh 提交于
      Summary:
      The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
      Closes https://github.com/facebook/rocksdb/pull/3747
      
      Differential Revision: D7708391
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
      c3d1e36c