1. 27 4月, 2018 1 次提交
  2. 26 4月, 2018 5 次提交
  3. 25 4月, 2018 2 次提交
    • A
      Add crash-recovery correctness check to db_stress · a4fb1f8c
      Andrew Kryczka 提交于
      Summary:
      Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.
      
      In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.
      
      For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.
      
      `db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
      Closes https://github.com/facebook/rocksdb/pull/3629
      
      Differential Revision: D7463144
      
      Pulled By: ajkr
      
      fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
      a4fb1f8c
    • M
      Skip duplicate bloom keys when whole_key and prefix are mixed · bc0da4b5
      Maysam Yabandeh 提交于
      Summary:
      Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate.
      Closes https://github.com/facebook/rocksdb/pull/3764
      
      Differential Revision: D7744413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8
      bc0da4b5
  4. 24 4月, 2018 3 次提交
    • G
      Support lowering CPU priority of background threads · 090c78a0
      Gabriel Wicke 提交于
      Summary:
      Background activities like compaction can negatively affect
      latency of higher-priority tasks like request processing. To avoid this,
      rocksdb already lowers the IO priority of background threads on Linux
      systems. While this takes care of typical IO-bound systems, it does not
      help much when CPU (temporarily) becomes the bottleneck. This is
      especially likely when using more expensive compression settings.
      
      This patch adds an API to allow for lowering the CPU priority of
      background threads, modeled on the IO priority API. Benchmarks (see
      below) show significant latency and throughput improvements when CPU
      bound. As a result, workloads with some CPU usage bursts should benefit
      from lower latencies at a given utilization, or should be able to push
      utilization higher at a given request latency target.
      
      A useful side effect is that compaction CPU usage is now easily visible
      in common tools, allowing for an easier estimation of the contribution
      of compaction vs. request processing threads.
      
      As with IO priority, the implementation is limited to Linux, degrading
      to a no-op on other systems.
      Closes https://github.com/facebook/rocksdb/pull/3763
      
      Differential Revision: D7740096
      
      Pulled By: gwicke
      
      fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
      090c78a0
    • M
      Improve write time breakdown stats · affe01b0
      Mike Kolupaev 提交于
      Summary:
      There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.
      
      These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
      Closes https://github.com/facebook/rocksdb/pull/3602
      
      Differential Revision: D7251562
      
      Pulled By: al13n321
      
      fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
      affe01b0
    • S
      Revert "Skip deleted WALs during recovery" · d5afa737
      Siying Dong 提交于
      Summary:
      This reverts commit 73f21a7b.
      
      It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:
      
      "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"
      
      This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
      Closes https://github.com/facebook/rocksdb/pull/3762
      
      Differential Revision: D7730035
      
      Pulled By: siying
      
      fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
      d5afa737
  5. 21 4月, 2018 7 次提交
    • A
      Avoid directory renames in BackupEngine · a8a28da2
      Andrew Kryczka 提交于
      Summary:
      We used to name private directories like "1.tmp" while BackupEngine populated them, and then rename without the ".tmp" suffix (i.e., rename "1.tmp" to "1") after all files were copied. On glusterfs, directory renames like this require operations across many hosts, and partial failures have caused operational problems.
      
      Fortunately we don't need to rename private directories. We already have a meta-file that uses the tempfile-rename pattern to commit a backup atomically after all its files have been successfully copied. So we can copy private files directly to their final location, so now there's no directory rename.
      Closes https://github.com/facebook/rocksdb/pull/3749
      
      Differential Revision: D7705610
      
      Pulled By: ajkr
      
      fbshipit-source-id: fd724a28dd2bf993ce323a5f2cb7e7d6980cc346
      a8a28da2
    • Y
      Disable EnvPosixTest::FilePermission · 2e72a589
      Yi Wu 提交于
      Summary:
      The test is flaky in our CI but could not be reproduce manually on the same CI host. Disabling it.
      Closes https://github.com/facebook/rocksdb/pull/3753
      
      Differential Revision: D7716320
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 6bed3b05880c1d24e8dc86bc970e5181bc98fb45
      2e72a589
    • M
      WritePrepared Txn: rollback via commit · bb2a2ec7
      Maysam Yabandeh 提交于
      Summary:
      Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap.
      This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache.
      The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction.
      Closes https://github.com/facebook/rocksdb/pull/3745
      
      Differential Revision: D7696193
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c
      bb2a2ec7
    • A
      Add a stat for MultiGet keys found, update memtable hit/miss stats · dbdaa466
      Anand Ananthabhotla 提交于
      Summary:
      1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
      number of keys successfully read
      2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
      DBImpl::GetImpl(), but not MultiGet
      Closes https://github.com/facebook/rocksdb/pull/3730
      
      Differential Revision: D7677364
      
      Pulled By: anand1976
      
      fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
      dbdaa466
    • M
      WritePrepared Txn: enable TryAgain for duplicates at the end of the batch · c3d1e36c
      Maysam Yabandeh 提交于
      Summary:
      The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
      Closes https://github.com/facebook/rocksdb/pull/3747
      
      Differential Revision: D7708391
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
      c3d1e36c
    • M
      Propagate fill_cache config to partitioned index iterator · 17e04039
      Maysam Yabandeh 提交于
      Summary:
      Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator.
      Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache.
      Closes https://github.com/facebook/rocksdb/pull/3739
      
      Differential Revision: D7678308
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da
      17e04039
    • P
      Fix GitHub issue #3716: gcc-8 warnings · dee95a1a
      przemyslaw.skibinski@percona.com 提交于
      Summary:
      Fix the following gcc-8 warnings:
      - conflicting C language linkage declaration [-Werror]
      - writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
      - array subscript -1 is below array bounds [-Werror=array-bounds]
      
      Solves https://github.com/facebook/rocksdb/issues/3716
      Closes https://github.com/facebook/rocksdb/pull/3736
      
      Differential Revision: D7684161
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a
      dee95a1a
  6. 20 4月, 2018 3 次提交
  7. 19 4月, 2018 3 次提交
    • Y
      Add block cache related DB properties · ad511684
      Yi Wu 提交于
      Summary:
      Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
      Closes https://github.com/facebook/rocksdb/pull/3734
      
      Differential Revision: D7657180
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd
      ad511684
    • A
      include thread-pool priority in thread names · 3cea6139
      Andrew Kryczka 提交于
      Summary:
      Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
      Closes https://github.com/facebook/rocksdb/pull/3702
      
      Differential Revision: D7581415
      
      Pulled By: ajkr
      
      fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1
      3cea6139
    • M
      Improve db_stress with transactions · 6d06be22
      Maysam Yabandeh 提交于
      Summary:
      db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch.
      - The uncommitted transaction must be either rolled back or commit after recovery.
      - Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0.
      Closes https://github.com/facebook/rocksdb/pull/3733
      
      Differential Revision: D7654419
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3
      6d06be22
  8. 18 4月, 2018 1 次提交
  9. 17 4月, 2018 3 次提交
  10. 16 4月, 2018 4 次提交
  11. 14 4月, 2018 5 次提交
    • A
      Implemented Knuth shuffle to construct permutation for selecting no_o… · 28087acd
      Amy Tai 提交于
      Summary:
      …verwrite_keys. Also changed each no_overwrite_key set to an unordered set, otherwise Knuth shuffle only gets you 2x time improvement, because insertion (and subsequent internal sorting) into an ordered set is the bottleneck.
      
      With this change, each iteration of permutation construction and prefix selection takes around 40 secs, as opposed to 360 secs previously. However, this still means that with the default 10 CF per blackbox test case, the test is going to time out given the default interval of 200 secs.
      
      Also, there is currently an assertion error affecting all blackbox tests in db_crashtest.py; this assertion error will be fixed in a future PR.
      Closes https://github.com/facebook/rocksdb/pull/3699
      
      Differential Revision: D7624616
      
      Pulled By: amytai
      
      fbshipit-source-id: ea64fbe83407ff96c1c0ecabbc6c830576939393
      28087acd
    • X
      Make database files' permissions configurable · a0102aa6
      Xiaofei Du 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/3709
      
      Differential Revision: D7610227
      
      Pulled By: xiaofeidu008
      
      fbshipit-source-id: 88a52f0f9f96e2195fccde995cf9760b785e9f07
      a0102aa6
    • Z
      add kEntryRangeDeletion · 31ee4bf2
      zhangjinpeng1987 提交于
      Summary:
      When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
      After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
      Closes https://github.com/facebook/rocksdb/pull/3695
      
      Differential Revision: D7570322
      
      Pulled By: ajkr
      
      fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
      31ee4bf2
    • S
      Merge raw and shared pointer log method impls · 1f5457ef
      Steven Fackler 提交于
      Summary:
      Calling rocksdb::Log, rocksdb::Info, etc with a `shared_ptr<Logger>` should behave the same as calling those functions with a `Logger *`. This PR achieves it by making the `shared_ptr<Logger>` versions delegate to the `Logger *` versions.
      
      Closes #3689
      Closes https://github.com/facebook/rocksdb/pull/3710
      
      Differential Revision: D7595557
      
      Pulled By: ajkr
      
      fbshipit-source-id: 64dd7f20fd42dc821bac7b8032705c35b483e00d
      1f5457ef
    • Y
      Improve accuracy of I/O stats collection of external SST ingestion. · c81b0abe
      Yanqin Jin 提交于
      Summary:
      RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.
      
      Rename a sync point.
      Closes https://github.com/facebook/rocksdb/pull/3713
      
      Differential Revision: D7604151
      
      Pulled By: riversand963
      
      fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586
      c81b0abe
  12. 13 4月, 2018 2 次提交
  13. 12 4月, 2018 1 次提交
    • M
      WritePrepared Txn: fix smallest_prep atomicity issue · 6f5e6445
      Maysam Yabandeh 提交于
      Summary:
      We introduced smallest_prep optimization in this commit b225de7e, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data.
      This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically.
      
      Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically.
      Closes https://github.com/facebook/rocksdb/pull/3703
      
      Differential Revision: D7581934
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32
      6f5e6445