1. 29 6月, 2018 5 次提交
    • Z
      fix clang analyzer warnings (#4072) · b3efb1cb
      Zhongyi Xie 提交于
      Summary:
      clang analyze is giving the following warnings:
      > db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
          } else if (meta->smallest.size() > 0) {
                     ^~~~~~~~~~~~~~~~~~~~~
      db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
          meta->marked_for_compaction = sub_compact->builder->NeedCompact();
          ~~~~
      db/version_set.cc:2770:26: warning: Called C++ object pointer is null
              uint32_t cf_id = last_writer->cfd->GetID();
                               ^~~~~~~~~~~~~~~~~~~~~~~~~
      Closes https://github.com/facebook/rocksdb/pull/4072
      
      Differential Revision: D8685852
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
      b3efb1cb
    • M
      WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069) · 8ad63a4b
      Manuel Ung 提交于
      Summary:
      This adds a new WAL marker of type kTypeBeginUnprepareXID.
      
      Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.
      
      Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
      Closes https://github.com/facebook/rocksdb/pull/4069
      
      Differential Revision: D8675099
      
      Pulled By: lth
      
      fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
      8ad63a4b
    • A
      Prefetch cache lines for filter lookup (#4068) · 25403c22
      Andrew Kryczka 提交于
      Summary:
      Since the filter data is unaligned, even though we ensure all probes are within a span of `cache_line_size` bytes, those bytes can span two cache lines. In that case I doubt hardware prefetching does a great job considering we don't necessarily access those two cache lines in order. This guess seems correct since adding explicit prefetch instructions reduced filter lookup overhead by 19.4%.
      Closes https://github.com/facebook/rocksdb/pull/4068
      
      Differential Revision: D8674189
      
      Pulled By: ajkr
      
      fbshipit-source-id: 747427d9a17900151c17820488e3f7efe06b1871
      25403c22
    • A
      Allow DB resume after background errors (#3997) · 52d4c9b7
      Anand Ananthabhotla 提交于
      Summary:
      Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
      1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
      2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
      3. Provide an API for the user to clear the error and resume the DB instance
      
      This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
      Closes https://github.com/facebook/rocksdb/pull/3997
      
      Differential Revision: D8653831
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
      52d4c9b7
    • Y
      Support group commits of version edits (#3944) · 26d67e35
      Yanqin Jin 提交于
      Summary:
      This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
      Closes https://github.com/facebook/rocksdb/pull/3944
      
      Differential Revision: D8432536
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
      26d67e35
  2. 28 6月, 2018 10 次提交
    • M
      Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070) · 0a5b5d88
      Maysam Yabandeh 提交于
      Summary:
      Lite does not support readonly DBs.
      Closes https://github.com/facebook/rocksdb/pull/4070
      
      Differential Revision: D8677858
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
      0a5b5d88
    • T
      Suppress leak warning for clang(LLVM) asan (#4066) · b557499e
      Taewook Oh 提交于
      Summary:
      Instead of __SANITIZE_ADDRESS__ macro, LLVM uses __has_feature(address_sanitzer) to check if ASAN is enabled for the build. I tested it with MySQL sanitizer build that uses RocksDB as a submodule.
      Closes https://github.com/facebook/rocksdb/pull/4066
      
      Reviewed By: riversand963
      
      Differential Revision: D8668941
      
      Pulled By: taewookoh
      
      fbshipit-source-id: af4d1da180c1470d257a228f431eebc61490bc36
      b557499e
    • Y
      Remove 'ALIGNAS' from StatisticsImpl. (#4061) · 7f850b88
      Yanqin Jin 提交于
      Summary:
      Remove over-alignment on `StatisticsImpl` whose benefit is vague and causes UBSAN check to fail due to `std::make_shared` not respecting the over-alignment requirement.
      
      Test plan
      ```
      $ make clean && COMPILE_WITH_UBSAN=1 OPT=-g make -j16 ubsan_check
      ```
      Closes https://github.com/facebook/rocksdb/pull/4061
      
      Differential Revision: D8656506
      
      Pulled By: riversand963
      
      fbshipit-source-id: db355ae9c7bdd2c9e9c5e63cabba13d8d82cc5f9
      7f850b88
    • Z
      PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067) · 14f409c0
      Zhongyi Xie 提交于
      Summary:
      with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
      Closes https://github.com/facebook/rocksdb/pull/4067
      
      Differential Revision: D8669002
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
      14f409c0
    • Z
      Add bottommost_compression_opts to for bottommost_compression (#3985) · 1f6efabe
      Zhichao Cao 提交于
      Summary:
      …ression
      
       For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
      Closes https://github.com/facebook/rocksdb/pull/3985
      
      Reviewed By: riversand963
      
      Differential Revision: D8385911
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
      1f6efabe
    • M
      Pin mmap files in ReadOnlyDB (#4053) · 235ab9dd
      Maysam Yabandeh 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
      Closes https://github.com/facebook/rocksdb/pull/4053
      
      Differential Revision: D8662546
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
      235ab9dd
    • M
      Added PingCaps Rust RocksDB and ObjectiveRocks (#4065) · e8f9d7f0
      Maximilian Alexander 提交于
      Summary:
      1. I added PingCap's more up-to-date Rust Binding of RocksDB
      2. I also added ObjectiveRocks which is a very nice binding for _both_ Swift and Objective-C
      Closes https://github.com/facebook/rocksdb/pull/4065
      
      Differential Revision: D8670340
      
      Pulled By: siying
      
      fbshipit-source-id: 3db28bf3a464c3e050df52cc92b19248b7f43944
      e8f9d7f0
    • C
      Store timestamp in deadlock detection (#4060) · 818c84e1
      chouxi 提交于
      Summary:
      - Summary
          Add timestamp into the DeadlockInfo to store the timestamp when deadlock detected on the rocksdb side.
      
      - Testplan:
          `make check -j64`
      Closes https://github.com/facebook/rocksdb/pull/4060
      
      Differential Revision: D8655380
      
      Pulled By: chouxi
      
      fbshipit-source-id: f58e1aa5e09eb1d1eed0a181d4e2304aaf01efe8
      818c84e1
    • D
      Remove bogus gcc-8.1 warning (#3870) · e5ae1bb4
      Daniel Black 提交于
      Summary:
      Various rearrangements of the cch maths failed or replacing = '\0' with
      memset failed to convince the compiler it was nul terminated. So took
      the perverse option of changing strncpy to strcpy.
      
      Return null if memory couldn't be allocated.
      
      util/status.cc: In static member function ‘static const char* rocksdb::Status::CopyState(const char*)’:
      util/status.cc:28:15: error: ‘char* strncpy(char*, const char*, size_t)’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
         std::strncpy(result, state, cch - 1);
         ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
      util/status.cc:19:18: note: length computed here
             std::strlen(state) + 1; // +1 for the null terminator
             ~~~~~~~~~~~^~~~~~~
      cc1plus: all warnings being treated as errors
      make: *** [Makefile:645: shared-objects/util/status.o] Error 1
      
      closes #2705
      Closes https://github.com/facebook/rocksdb/pull/3870
      
      Differential Revision: D8594114
      
      Pulled By: anand1976
      
      fbshipit-source-id: ab20f3a456a711e4d29144ebe630e4fe3c99ec25
      e5ae1bb4
    • M
      WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955) · a16e00b7
      Manuel Ung 提交于
      Summary:
      This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
      
      There are the places where reads had to be modified.
      - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
      - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
      - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
      Closes https://github.com/facebook/rocksdb/pull/3955
      
      Differential Revision: D8286688
      
      Pulled By: lth
      
      fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
      a16e00b7
  3. 27 6月, 2018 5 次提交
    • N
      Add table property tracking number of range deletions (#4016) · 17339dc2
      Nikhil Benesch 提交于
      Summary:
      Add a new table property, rocksdb.num.range-deletions, which tracks the
      number of range deletions in a block-based table. Range deletions are no
      longer counted in rocksdb.num.entries; as discovered in PR #3778, there
      are various code paths that implicitly assume that rocksdb.num.entries
      counts only true keys, not range deletions.
      
      /cc ajkr nvanbenschoten
      Closes https://github.com/facebook/rocksdb/pull/4016
      
      Differential Revision: D8527575
      
      Pulled By: ajkr
      
      fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
      17339dc2
    • Z
      use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899) · 408205a3
      Zhongyi Xie 提交于
      Summary:
      Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
      This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
      Closes https://github.com/facebook/rocksdb/pull/3899
      
      Differential Revision: D8157459
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
      408205a3
    • B
      Create lgtm.yml for LGTM.com C/C++ analysis (#4058) · 967aa815
      Bas van Schaik 提交于
      Summary:
      As discussed with thatsafunnyname [here](https://discuss.lgtm.com/t/c-c-lang-missing-for-facebook-rocksdb/1079): this configuration enables C/C++ analysis for RocksDB on LGTM.com.
      
      The initial commit will contain a build command (simple `make`) that previously resulted in a build error. The build log will then be available on LGTM.com for you to investigate (if you like). I'll immediately add a second commit to this PR to correct the build command to `make static_lib`, which worked when I tested it earlier today.
      
      If you like you can also enable automatic code review in pull requests. This will alert you to any new code issues before they actually get merged into `master`. Here's an example of how that works for the AMPHTML project: https://github.com/ampproject/amphtml/pull/13060. You can enable it yourself here: https://lgtm.com/projects/g/facebook/rocksdb/ci/.
      
      I'll also add a badge to your README.md in a separate commit — feel free to remove that from this PR if you don't like it.
      
      (Full disclosure: I'm part of the LGTM.com team 🙂. Ping samlanning)
      Closes https://github.com/facebook/rocksdb/pull/4058
      
      Differential Revision: D8648410
      
      Pulled By: ajkr
      
      fbshipit-source-id: 98d55fc19cff1b07268ac8425b63e764806065aa
      967aa815
    • P
      Remove unused imports, from python scripts. (#4057) · 2694b6dc
      Peter (Stig) Edwards 提交于
      Summary:
      Also remove redefined variable.
      As reported on https://lgtm.com/projects/g/facebook/rocksdb/
      Closes https://github.com/facebook/rocksdb/pull/4057
      
      Differential Revision: D8648342
      
      Pulled By: ajkr
      
      fbshipit-source-id: afd2ba84d1364d316010179edd44777e64ca9183
      2694b6dc
    • A
      Fix universal compaction scheduling conflict with CompactFiles (#4055) · a8e503e5
      Andrew Kryczka 提交于
      Summary:
      Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked.
      
      We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure.
      
      ```
      db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed.
      Aborted (core dumped)
      ```
      Closes https://github.com/facebook/rocksdb/pull/4055
      
      Differential Revision: D8630094
      
      Pulled By: ajkr
      
      fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8
      a8e503e5
  4. 26 6月, 2018 3 次提交
  5. 24 6月, 2018 2 次提交
  6. 23 6月, 2018 3 次提交
  7. 22 6月, 2018 6 次提交
    • Z
      option for timing measurement of non-blocking ops during compaction (#4029) · 795e663d
      Zhongyi Xie 提交于
      Summary:
      For example calling CompactionFilter is always timed and gives the user no way to disable.
      This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers`
      Closes https://github.com/facebook/rocksdb/pull/4029
      
      Differential Revision: D8583670
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f
      795e663d
    • A
      Cleanup staging directory at start of checkpoint (#4035) · 0a5b16c7
      Andrew Kryczka 提交于
      Summary:
      - Attempt to clean the checkpoint staging directory before starting a checkpoint. It was already cleaned up at the end of checkpoint. But it wasn't cleaned up in the edge case where the process crashed while staging checkpoint files.
      - Attempt to clean the checkpoint directory before calling `Checkpoint::Create` in `db_stress`. This handles the case where checkpoint directory was created by a previous `db_stress` run but the process crashed before cleaning it up.
      - Use `DestroyDB` for cleaning checkpoint directory since a checkpoint is a DB.
      Closes https://github.com/facebook/rocksdb/pull/4035
      
      Reviewed By: yiwu-arbug
      
      Differential Revision: D8580223
      
      Pulled By: ajkr
      
      fbshipit-source-id: 28c667400e249fad0fdedc664b349031b7b61599
      0a5b16c7
    • S
      Assert for Direct IO at the beginning in PositionedRead (#3891) · 645e57c2
      Sagar Vemuri 提交于
      Summary:
      Moved the direct-IO assertion to the top in `PosixSequentialFile::PositionedRead`, as it doesn't make sense to check for sector alignments before checking for direct IO.
      Closes https://github.com/facebook/rocksdb/pull/3891
      
      Differential Revision: D8267972
      
      Pulled By: sagar0
      
      fbshipit-source-id: 0ecf77c0fb5c35747a4ddbc15e278918c0849af7
      645e57c2
    • Y
      Update TARGETS file (#4028) · 58c22144
      Yi Wu 提交于
      Summary:
      -Wshorten-64-to-32 is invalid flag in fbcode. Changing it to -Warrowing.
      Closes https://github.com/facebook/rocksdb/pull/4028
      
      Differential Revision: D8553694
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 1523cbcb4c76cf1d2b10a4d28b5f58c78e6cb876
      58c22144
    • Y
      Fix a warning (treated as error) caused by type mismatch. · 39749596
      Yanqin Jin 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/4032
      
      Differential Revision: D8573061
      
      Pulled By: riversand963
      
      fbshipit-source-id: 112324dcb35956d6b3ec891073f4f21493933c8b
      39749596
    • S
      Improve direct IO range scan performance with readahead (#3884) · 7103559f
      Sagar Vemuri 提交于
      Summary:
      This PR extends the improvements in #3282 to also work when using Direct IO.
      We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
      
      **Description:**
      This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
      
      **Implementation Details:**
      - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
      - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
      - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
      - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
      - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
      
      **Constraints:**
      - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
      - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
      - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
      
      **Benchmarks:**
      I used the same benchmark as used in #3282.
      Data fill:
      ```
      TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
      ```
      
      Do a long range scan: Seekrandom with large number of nexts
      ```
      TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
      ```
      
      ```
      Before:
      seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
      With this change:
      seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
      ```
      ~4.5X perf improvement. Taken on an average of 3 runs.
      Closes https://github.com/facebook/rocksdb/pull/3884
      
      Differential Revision: D8082143
      
      Pulled By: sagar0
      
      fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
      7103559f
  8. 21 6月, 2018 2 次提交
    • Y
      Add file name info to SequentialFileReader. (#4026) · 524c6e6b
      Yanqin Jin 提交于
      Summary:
      We potentially need this information for tracing, profiling and diagnosis.
      Closes https://github.com/facebook/rocksdb/pull/4026
      
      Differential Revision: D8555214
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
      524c6e6b
    • A
      Support file ingestion in stress test (#4018) · 14cee194
      Andrew Kryczka 提交于
      Summary:
      Once per `ingest_external_file_one_in` operations, uses SstFileWriter to create a file containing `ingest_external_file_width` consecutive keys. The file is named containing the thread ID to avoid clashes. The file is then added to the DB using `IngestExternalFile`.
      
      We can't enable it by default in crash test because `nooverwritepercent` and `test_batches_snapshot` both must be zero for the DB's whole lifetime. Perhaps we should setup a separate test with that config as range deletion also requires it.
      Closes https://github.com/facebook/rocksdb/pull/4018
      
      Differential Revision: D8507698
      
      Pulled By: ajkr
      
      fbshipit-source-id: 1437ea26fd989349a9ce8b94117241c65e40f10f
      14cee194
  9. 20 6月, 2018 4 次提交