1. 05 11月, 2021 4 次提交
    • P
      Fix -Werror=type-limits seen in Travis (#9128) · 2a3511a0
      Peter Dillinger 提交于
      Summary:
      Work around annoying compiler warning-as-error from https://github.com/facebook/rocksdb/issues/9113
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9128
      
      Test Plan: CI
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D32181499
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d7e5f7857a29f7ba47c49c3aee7150b5763b65d9
      2a3511a0
    • H
      Deallocate payload of BlockBasedTableBuilder::Rep::FilterBlockBuilder earlier... · 1ababeb7
      Hui Xiao 提交于
      Deallocate payload of BlockBasedTableBuilder::Rep::FilterBlockBuilder earlier for Full/PartitionedFilter (#9070)
      
      Summary:
      Note: This PR is the 1st part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073).
      
      Context:
      Previously, the payload (i.e, filter data) within `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object is not deallocated until `BlockBasedTableBuilder` is deallocated, despite it is no longer useful after its related `filter_content` being written.
      - Transferred the payload (i.e, the filter data) out of `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object
      - For PartitionedFilter:
         - Unified `filters` and `filter_gc` lists into one `std::deque<FilterEntry> filters` by adding a new field `last_filter_entry_key` and storing the `std::unique_ptr filter_data` with the `Slice filter` in the same entry
         - Reset `last_filter_data` in the case where `filters` is empty, which should be as by then we would've finish using all the `Slice filter`
      - Deallocated the payload by going out of scope as soon as we're done with using the `filter_content` associated with the payload
      - This is an internal interface change at the level of `FilterBlockBuilder::Finish()`, which leads to touching the inherited interface in `BlockBasedFilterBlockBuilder`. But for that, the payload transferring is ignored.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9070
      
      Test Plan: - The main focus is to catch segment fault error during `FilterBlockBuilder::Finish()` and `BlockBasedTableBuilder::Finish()` and interface mismatch. Relying on existing CI tests is enough as `assert(false)` was temporarily added to verify the new logic of transferring ownership indeed run
      
      Reviewed By: pdillinger
      
      Differential Revision: D31884933
      
      Pulled By: hx235
      
      fbshipit-source-id: f73ecfbea13788d4fc058013ace27230110b52f4
      1ababeb7
    • H
      Sanitize negative request bytes in GenericRateLimiter::Request and clarify API (#9112) · a64c8ca7
      Hui Xiao 提交于
      Summary:
      Context:
      Surprisingly, there isn't any sanitization against negative `int64_t bytes` in `GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats)`. A negative `bytes` can be passed in and incorrectly increases `available_bytes_` by subtracting the negative `bytes` from `available_bytes_`, such as  [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L138) and [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L283), which are incorrect behaviors.
      - Sanitized negative request bytes by rounding it up to 0
      - Added notes to public and internal API
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9112
      
      Test Plan: - Rely on existing tests
      
      Reviewed By: ajkr
      
      Differential Revision: D32085364
      
      Pulled By: hx235
      
      fbshipit-source-id: b1b6066b2dd5ffc7bcbfb07069ca65a33578251b
      a64c8ca7
    • P
      Some checksum code refactoring (#9113) · dfedc74d
      Peter Dillinger 提交于
      Summary:
      To prepare for adding checksum to footer and "context aware"
      checksums. This also brings closely related code much closer together.
      
      Recently added `BlockBasedTableBuilder::ComputeBlockTrailer` for testing
      is made obsolete in the refactoring, as testing the checksums can happen
      at a lower level of abstraction.
      
      Also now checking for unrecognized checksum type on reading footer,
      rather than later on use.
      
      Also removed an obsolete function delcaration.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9113
      
      Test Plan:
      existing tests worked before refactoring to remove
      `ComputeBlockTrailer`. And then refactored+improved tests using it.
      
      Reviewed By: mrambacher
      
      Differential Revision: D32090149
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2879da683c1498ea85a3b70dace9b6d9f6b47b6e
      dfedc74d
  2. 04 11月, 2021 5 次提交
    • A
      Re-enable 390x+cmake* Travis jobs (#9110) · 312d9c47
      Adam Retter 提交于
      Summary:
      Revert "Temporarily disable s390x+cmake* Travis jobs (https://github.com/facebook/rocksdb/issues/9095)"
      This reverts commit f2d11b3f.
      
      I have now uploaded the CMake deb for s390x provided by jonathan-albrecht-ibm to our S3 bucket.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9110
      
      Reviewed By: ajkr
      
      Differential Revision: D32082903
      
      Pulled By: riversand963
      
      fbshipit-source-id: b7243d19fc133e665a8654e3b528c4f53d5b11d1
      312d9c47
    • Y
      Fixed a bug in CompactionIterator when write-preared transaction is used (#9060) · 9b53f14a
      Yanqin Jin 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060
      
      RocksDB bottommost level compaction may zero out an internal key's sequence if
      the key's sequence is in the earliest_snapshot.
      In write-prepared transaction, checking the visibility of a certain sequence in
      a specific released snapshot may return a "snapshot released" result.
      Therefore, it is possible, after a certain sequence of events, a PUT has its
      sequence zeroed out, but a subsequent SingleDelete of the same key will still
      be output with its original sequence. This violates the ascending order of
      keys and leads to incorrect result.
      
      The solution is to use an extra variable `last_key_seq_zeroed_` to track the
      information about visibility in earliest snapshot. With this variable, we can
      know for sure that a SingleDelete is in the earliest snapshot even if the said
      snapshot is released during compaction before processing the SD.
      
      Reviewed By: ltamasi
      
      Differential Revision: D31813016
      
      fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
      9b53f14a
    • J
      Fixing child jobs of rocksdb for duplo deprecation (#9117) · 56810142
      Jack Feng 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9117
      
      This fixes the loosely formatted json child job spec. I.e., trailing commas
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D32119606
      
      fbshipit-source-id: 0ef571ccbef0e638df18f17288611d3b5774f129
      56810142
    • J
      Skip directory fsync for filesystem btrfs (#8903) · 29102641
      Jay Zhuang 提交于
      Summary:
      Directory fsync might be expensive on btrfs and it may not be needed.
      Here are 4 directory fsync cases:
      1. creating a new file: dir-fsync is not needed on btrfs, as long as the
         new file itself is synced.
      2. renaming a file: dir-fsync is not needed if the renamed file is
         synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
         to sync the file on btrfs. By default, it just calls dir-fsync.
      3. deleting files: dir-fsync is forced by set
         `IOOptions.force_dir_fsync = true`
      4. renaming multiple files (like backup and checkpoint): dir-fsync is
         forced, the same as above.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
      
      Test Plan: run tests on btrfs and non btrfs
      
      Reviewed By: ajkr
      
      Differential Revision: D30885059
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
      29102641
    • L
      Refactor the detailed consistency checks and the SST saving logic in VersionBuilder (#9099) · 08172278
      Levi Tamasi 提交于
      Summary:
      The patch refactors the parts of `VersionBuilder` that deal with SST file
      comparisons. Specifically, it makes the following changes:
      * Turns `NewestFirstBySeqNo` and `BySmallestKey` from free-standing
      functions into function objects. Note: `BySmallestKey` has a pointer to the
      `InternalKeyComparator`, while `NewestFirstBySeqNo` is completely
      stateless.
      * Eliminates the wrapper `FileComparator`, which was essentially an
      unnecessary DIY virtual function call mechanism.
      * Refactors `CheckConsistencyDetails` and `SaveSSTFilesTo` using helper
      function templates that take comparator/checker function objects. Using
      static polymorphism eliminates the need to make runtime decisions about
      which comparator to use.
      * Extends some error messages returned by the consistency checks and
      makes them more uniform.
      * Removes some incomplete/redundant consistency checks from `VersionBuilder`
      and `FilePicker`.
      * Improves const correctness in several places.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9099
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D32027503
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 621326ae41f4f55f7ad6a91abbd6e666d5c7857c
      08172278
  3. 03 11月, 2021 5 次提交
    • P
      Don't call OnTableFileCreated with OK for empty+deleted file (#9118) · 2b60621f
      Peter Dillinger 提交于
      Summary:
      EventListener::OnTableFileCreated was previously called with OK
      status and file_size==0 in cases of no SST file contents written
      (because there was no content to add) and the empty file deleted before
      calling the listener. This could lead to a stress test assertion failure
      added in https://github.com/facebook/rocksdb/issues/9054.
      
      This changes the status to Aborted, to align with the API doc:
      "... if the file is successfully created. Now it will also be called on
      failure case. User can check info.status to see if it succeeded or not."
      For internal purposes, this case is considered "success" but for
      listener purposes, no SST file is (successfully) created.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9118
      
      Test Plan: test case added + existing db_stress
      
      Reviewed By: ajkr, riversand963
      
      Differential Revision: D32120232
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a804e2e0a52598018d3b182da97804d402ffcdfa
      2b60621f
    • P
      Fix TSAN report on MemPurge test (#9115) · 21f8a57f
      Peter Dillinger 提交于
      Summary:
      TSAN reported data race on count variables in MemPurgeBasic
      test. This suggests the test could fail if mempurges were slow enough
      that they don't complete before the count variables being checked, but
      injecting a long sleep into MemPurge (outside DB mutex) confirms that
      blocked writes ensure enough mempurges/flushes happen to make the test
      pass. All the possible different values on testing should be OK to make
      the test pass.
      
      So this change makes the variables atomic so that up-to-date value is
      always read and TSAN report suppressed. I have also used `.exchange(0)`
      to make the checking less stateful by "popping off" all the accumulated
      counts.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9115
      
      Test Plan: updated test, watch for any flakiness
      
      Reviewed By: riversand963
      
      Differential Revision: D32114432
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c985609d39896a0d8f69ebc87b221e688609bdd8
      21f8a57f
    • A
      Clarify setting `CompressionOptions::max_dict_bytes > 0` will charge block cache (#9119) · 67a7b74b
      Andrew Kryczka 提交于
      Summary:
      Add it to the API comment.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9119
      
      Reviewed By: hx235
      
      Differential Revision: D32124238
      
      Pulled By: ajkr
      
      fbshipit-source-id: d1f82037417d883f2000f2d62995a7708dda77c6
      67a7b74b
    • P
      Some API clarifications (#9080) · 82afa018
      Peter Dillinger 提交于
      Summary:
      * Clarify that RocksDB is not exception safe on many of our callback
      and extension interfaces
      * Clarify FSRandomAccessFile::MultiRead implementations must accept
      non-sorted inputs (see https://github.com/facebook/rocksdb/issues/8953)
      * Clarify ConcurrentTaskLimiter and SstFileManager are not (currently)
      extensible interfaces
      * Mark WriteBufferManager as `final`, so it is then clearly not a
      callback interface, even though it smells like one
      * Clarify TablePropertiesCollector Status returns are mostly ignored
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9080
      
      Test Plan: comments only (except WriteBufferManager final)
      
      Reviewed By: ajkr
      
      Differential Revision: D31968782
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 11b648ce3ce3c5e5bdc02d2eafc7ea4b864bd1d2
      82afa018
    • M
      Make FileSystem a Customizable Class (#8649) · f72c834e
      mrambacher 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D32036059
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295
      f72c834e
  4. 02 11月, 2021 10 次提交
    • L
      Mention PR 9100 in HISTORY.md (#9111) · cfc57f55
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9111
      
      Reviewed By: riversand963
      
      Differential Revision: D32076310
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 81e30a02ded87c0f1a42985db42e80b62235ba11
      cfc57f55
    • A
      Regression tests for tickets fixed by previous change. (#9019) · ec9082d6
      Alan Paxton 提交于
      Summary:
      closes https://github.com/facebook/rocksdb/issues/5891
      closes https://github.com/facebook/rocksdb/issues/2001
      
      Java BytewiseComparator is now unsigned compliant, consistent with the default C++ comparator, which has always been thus. Consequently 2 tickets reporting the previous broken state can be closed.
      
       This test confirms that the following issues were in fact resolved
       by a change made between 6.2.2 and 6.22.1,
       to wit https://github.com/facebook/rocksdb/commit/7242dae7
      which as part of its effect, changed the Java bytewise comparators.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9019
      
      Reviewed By: pdillinger
      
      Differential Revision: D31610910
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 664230f1377a1aa270136edd63eea2c206b907e9
      ec9082d6
    • H
      Add new API CacheReservationManager::GetDummyEntrySize() (#9072) · 560fe702
      Hui Xiao 提交于
      Summary:
      Note: it might conflict with another CRM related PR https://github.com/facebook/rocksdb/pull/9071 and so will merge after that's merged.
      
      Context:
      As `CacheReservationManager` being used by more memory users, it is convenient to retrieve the dummy entry size for `CacheReservationManager` instead of hard-coding `256 * 1024` in writing tests. Plus it allows more flexibility to change our implementation on dummy entry size.
      
      A follow-up PR is needed to replace those hard-coded dummy entry size value in `db_test2.cc`, `db_write_buffer_manager_test.cc`, `write_buffer_manager_test.cc`, `table_test.cc` and the ones introduced in https://github.com/facebook/rocksdb/pull/9072#issue-1034326069.
      - Exposed the private static constexpr `kDummyEntrySize` through public static `CacheReservationManager::GetDummyEntrySize()`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9072
      
      Test Plan:
      - Passing new tests
      - Passing existing tests
      
      Reviewed By: ajkr
      
      Differential Revision: D32043684
      
      Pulled By: hx235
      
      fbshipit-source-id: ddefc6921c052adab6a2cda2394eb26da3076a50
      560fe702
    • S
      Try to start TTL earlier with kMinOverlappingRatio is used (#8749) · a2b9be42
      sdong 提交于
      Summary:
      Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.
      
      When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.
      
      In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749
      
      Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D30735261
      
      fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
      a2b9be42
    • H
      Minor improvement to #8428 (Account for dictionary-building buffer in global memory limit) (#9032) · a5ec5e3e
      hx235 提交于
      Summary:
      Summary/Context:
      - Renamed `cache_rev_mng` to `compression_dict_buffer_cache_res_mgr`
         - It is to distinguish with other potential `cache_res_mgr` in `BlockBasedTableBuilder` and to use correct short-hand for the words "reservation", "manager"
      - Added `table_options.block_cache == nullptr` in additional to `table_options.no_block_cache == true` to be conditions where we don't create a `CacheReservationManager`
         - Theoretically `table_options.no_block_cache == true` is equivalent to `table_options.block_cache == nullptr` by API. But since segment fault will be generated by passing `nullptr` into `CacheReservationManager`'s constructor, it does not hurt to directly verify  `table_options.block_cache != nullptr` before passing in
      - Renamed `is_cache_full` to `exceeds_global_block_cache_limit`
         - It is to hide implementation detail of cache reservation and to emphasize on the concept/design intent of caping memory within global block cache limit
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9032
      
      Test Plan: - Passing existing tests
      
      Reviewed By: ajkr
      
      Differential Revision: D32005807
      
      Pulled By: hx235
      
      fbshipit-source-id: 619fd17bb924199de3db5924d8ab7dae53b1efa2
      a5ec5e3e
    • L
      fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026) · 230c98f3
      leipeng 提交于
      Summary:
      currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026
      
      Reviewed By: ajkr
      
      Differential Revision: D31668241
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af
      230c98f3
    • L
      Add a consistency check that prevents the overflow of garbage in blob files (#9100) · b1c27a52
      Levi Tamasi 提交于
      Summary:
      The number or total size of garbage blobs in any given blob file can
      never exceed the number or total size of all blobs in the file. (This
      would be a similar error to e.g. attempting to remove from the LSM tree
      an SST file that has already been removed.) The patch builds on
      https://github.com/facebook/rocksdb/issues/9085 and adds a
      consistency check to `VersionBuilder` that prevents the above from
      happening.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9100
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D32048982
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 6f7e0793bf534ad04c3359cc0f696b8e4e5ef81c
      b1c27a52
    • A
      Java wrapper for blob_gc_force_threshold as blobGarbageCollectionForceThreshold (#9109) · 73e6b89f
      Alan Paxton 提交于
      Summary:
      Extra option added as a supplement to https://github.com/facebook/rocksdb/pull/8999
      
      Closes https://github.com/facebook/rocksdb/issues/8221
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9109
      
      Reviewed By: mrambacher
      
      Differential Revision: D32065039
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 6c484050a30fe0523850a8a3c95dc85b0a501362
      73e6b89f
    • L
      remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064) · 2b70224f
      leipeng 提交于
      Summary:
      This PR fix wrong ticker `WRITE_WITH_WAL`.
      
      `RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`.
      
      Fixes:
      1. Delete these two extra `RecordTick(WRITE_WITH_WAL)`
      2. Fix corresponding test case
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064
      
      Reviewed By: ajkr
      
      Differential Revision: D31944459
      
      Pulled By: riversand963
      
      fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086
      2b70224f
    • Y
      Update buckify scripts (#9104) · 0e12b1d6
      Yanqin Jin 提交于
      Summary:
      https://github.com/facebook/rocksdb/commit/49af999954c0c130fefdb5f4bafc919c18341521
      updates RocksDB buckifier script directly via fbcode. We need to make
      sure that the following command run in RocksDB repo generate the same
      TARGETS file.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9104
      
      Test Plan:
      ```
      $python buckifier/buckify_rocksdb.py
      ```
      Verify that TARGETS file does not have uncommitted changes.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D32055387
      
      Pulled By: riversand963
      
      fbshipit-source-id: 19cf1b8145095b6df625958458189680e543e3ba
      0e12b1d6
  5. 01 11月, 2021 4 次提交
    • L
      InternalStats::DumpCFMapStat: fix sum.w_amp (#9065) · 01bd86ad
      leipeng 提交于
      Summary:
      sum `w_amp` will be a very large number`(bytes_written + bytes_written_blob)` when there is no any flush and ingest.
      
      This PR set sum `w_amp` to zero if there is no any flush and ingest, this is conform to per-level `w_amp` computation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9065
      
      Reviewed By: ajkr
      
      Differential Revision: D31943994
      
      Pulled By: riversand963
      
      fbshipit-source-id: acbef5e331debebfad09e0e0d8d0885ebbc00609
      01bd86ad
    • Y
      Avoid div-by-zero error in db_stress (#9086) · d2635054
      Yanqin Jin 提交于
      Summary:
      If a column family has 0 levels, then existing `TestCompactFiles(...)` may hit
      divide-by-zero. To fix, return early if the cf is empty.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9086
      
      Test Plan: TBD
      
      Reviewed By: ajkr
      
      Differential Revision: D31986799
      
      Pulled By: riversand963
      
      fbshipit-source-id: 48f7dfb2b2b47cfc1315cb71ca80eb230d947f17
      d2635054
    • Y
      Attempt to deflake ListenerTest.MultiCF (#9084) · 8e59a1dc
      Yanqin Jin 提交于
      Summary:
      EventListenerTest.MultiCF uses TestFlushListener which has members
      flushed_dbs_ and flushed_column_family_names_ that are not protected by
      locks. This implicitly indicates that we need to ensure the methods
      accessing these data structures in a single threaded way. In other
      tests, e.g. MultiDBMultiListeners, we use TEST_WaitForFlushMemtable() to
      wait until all memtables of a given column family are flushed, hence no
      pending flush threads will concurrently call OnFlushCompleted() and
      cause data race for flushed_dbs_. To fix a test failure, we should do
      the same for MultiCF.
      
      Example data race stack traces reported by TSAN
      ```
      Read of size 8 at 0x7b6000002840 by main thread:
          #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:655:40
          https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:380:7
      
      Previous write of size 8 at 0x7b6000002840 by thread T2:
          #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_emplace_back_aux<rocksdb::DB* const&>(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/vector.tcc:442:26
          https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:923:4
          https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9084
      
      Test Plan: ./listener_test --gtest_filter=EventListenerTest.MultiCF
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D31952259
      
      Pulled By: riversand963
      
      fbshipit-source-id: 94a7f29e4e9466ead42418944eb2247fc32bd499
      8e59a1dc
    • Y
      Attempt to deflake DBFlushTest.FireOnFlushCompletedAfterCommittedResult (#9083) · 8f4f3023
      Yanqin Jin 提交于
      Summary:
      DBFlushTest.FireOnFlushCompletedAfterCommittedResult uses test sync
      points to coordinate interleaving of different threads. Before this PR,
      the test writes some data to memtable, triggers a manual flush, and
      triggers a second manual flush after a first bg flush thread starts
      executing. Though unlikely, it is possible for the second bg flush
      thread to run faster than the first bg flush thread and deques flush
      queue first. In this case, the original test will fail.
      The fix is to wait until the first bg flush thread deques the flush
      queue before triggering second manual flush.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9083
      
      Test Plan: ./db_flush_test --gtest_filter=DBFlushTest.FireOnFlushCompletedAfterCommittedResult
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D31951239
      
      Pulled By: riversand963
      
      fbshipit-source-id: f32d7cdabe6ad6808fd18e54e663936dc0a9edb4
      8f4f3023
  6. 30 10月, 2021 4 次提交
    • C
      internal_repo_rocksdb/repo · 49af9999
      CodemodService Bot 提交于
      Reviewed By: DrMarcII
      
      Differential Revision: D32033741
      
      fbshipit-source-id: af12d9d72f109a4a2837cb64e02fa0dbc9175711
      49af9999
    • L
      Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) · 44d04582
      Levi Tamasi 提交于
      Summary:
      The current VersionBuilder code on mainline keeps track of blob file related
      changes ("delta") induced by a series of `VersionEdit`s in the form of
      `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta`
      contains the amount of additional garbage generated by compactions, as well
      as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves,
      since in that case the newly linked and unlinked SSTs cancel each other out.
      However, this representation does not allow us to easily tell whether a certain
      blob file is obsolete after applying a set of `VersionEdit`s or not. In order to
      solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition
      to the delta, also contains the materialized state after applying a set of version edits
      (i.e. the total amount of garbage and the resulting set of linked SSTs). This will
      enable us to add further consistency checks and to improve certain pieces of
      functionality where knowing up front which blob files get obsoleted is beneficial.
      (Note: this patch is just the refactoring part; I plan to create separate PRs for
      the enhancements.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085
      
      Test Plan: Ran `make check` and the stress tests in BlobDB mode.
      
      Reviewed By: riversand963
      
      Differential Revision: D31980867
      
      Pulled By: ltamasi
      
      fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
      44d04582
    • Y
      Fix a compaction bug for write-prepared txn (#9061) · fdf2a0d7
      Yanqin Jin 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061
      
      In write-prepared txn, checking a sequence's visibility in a released (old)
      snapshot may return "Snapshot released". Suppose we have two snapshots:
      
      ```
      earliest_snap < earliest_write_conflict_snap
      ```
      
      If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
      bottommost level compaction, then it is possible that certain sequence of
      events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
      same key. This violates the ascending order of keys, and will cause data
      inconsistency.
      
      Reviewed By: ltamasi
      
      Differential Revision: D31813017
      
      fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
      fdf2a0d7
    • J
      Temporarily disable s390x+cmake* Travis jobs (#9095) · f2d11b3f
      Jonathan Albrecht 提交于
      Summary:
      Temporarily disable s390x+cmake* jobs until a cmake-3.14.5-Linux-s390x.deb can be installed to https://rocksdb-deps.s3-us-west-2.amazonaws.com.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9095
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D32025417
      
      Pulled By: riversand963
      
      fbshipit-source-id: eefb9737937987c7d9273482a89e4d2266cd5375
      f2d11b3f
  7. 29 10月, 2021 2 次提交
    • P
      Fix EnvLibrados and add to CI (#9088) · 92e23996
      Peter Dillinger 提交于
      Summary:
      This feature was not part of any common or CI build, so no
      surprise it broke. Now we can at least ensure compilation. I don't know
      how to run the test successfully (missing config file) so it is bypassed
      for now.
      
      Fixes https://github.com/facebook/rocksdb/issues/9078
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088
      
      Test Plan: CI
      
      Reviewed By: mrambacher
      
      Differential Revision: D32009467
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
      92e23996
    • P
      Implement XXH3 block checksum type (#9069) · a7d4bea4
      Peter Dillinger 提交于
      Summary:
      XXH3 - latest hash function that is extremely fast on large
      data, easily faster than crc32c on most any x86_64 hardware. In
      integrating this hash function, I have handled the compression type byte
      in a non-standard way to avoid using the streaming API (extra data
      movement and active code size because of hash function complexity). This
      approach got a thumbs-up from Yann Collet.
      
      Existing functionality change:
      * reject bad ChecksumType in options with InvalidArgument
      
      This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
      likely to be handled through different configuration than ChecksumType.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
      
      Test Plan:
      tests updated, and substantially expanded. Unit tests now check
      that we don't accidentally change the values generated by the checksum
      algorithms ("schema test") and that we properly handle
      invalid/unrecognized checksum types in options or in file footer.
      
      DBTestBase::ChangeOptions (etc.) updated from two to one configuration
      changing from default CRC32c ChecksumType. The point of this test code
      is to detect possible interactions among features, and the likelihood of
      some bad interaction being detected by including configurations other
      than XXH3 and CRC32c--and then not detected by stress/crash test--is
      extremely low.
      
      Stress/crash test also updated (manual run long enough to see it accepts
      new checksum type). db_bench also updated for microbenchmarking
      checksums.
      
       ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
      crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
      xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
      xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
      xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
      crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
      xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
      xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
      xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
      crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
      xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
      xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
      xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
      
      As you can see, especially once warmed up, xxh3 is fastest.
      
       ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      Test
      
          for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
      
      Results (ops/sec)
      
          for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
      
      results-0 252118 # kNoChecksum
      results-1 251588 # kCRC32c
      results-2 251863 # kxxHash
      results-3 252016 # kxxHash64
      results-4 252038 # kXXH3
      
      Reviewed By: mrambacher
      
      Differential Revision: D31905249
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
      a7d4bea4
  8. 28 10月, 2021 4 次提交
    • A
      Prevent corruption with parallel manual compactions and `change_level == true` (#9077) · f24c39ab
      Andrew Kryczka 提交于
      Summary:
      The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
      
      Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered  (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
      
      The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
      
      Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
      
      The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
      
      Thanks to siying for finding the bug.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
      
      Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
      
      Reviewed By: siying
      
      Differential Revision: D31921908
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
      f24c39ab
    • P
      Clarify caching behavior for index and filter partitions (#9068) · 5bf9a7d5
      Peter Dillinger 提交于
      Summary:
      Somewhat confusingly, index and filter partition blocks are
      never owned by table readers, even with
      cache_index_and_filter_blocks=false. They still go into block cache
      (possibly pinned by table reader) if there is a block cache. If no block
      cache, they are only loaded transiently on demand.
      
      This PR primarily clarifies the options APIs and some internal code
      comments.
      
      Also, this closes a hypothetical data corruption vulnerability where
      some but not all index partitions are pinned. I haven't been able to
      reproduce a case where it can happen (the failure seems to propagate
      to abort table open) but it's worth patching nonetheless.
      
      Fixes https://github.com/facebook/rocksdb/issues/8979
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068
      
      Test Plan:
      existing tests :-/  I could cover the new code using sync
      points, but then I'd have to very carefully relax my `assert(false)`
      
      Reviewed By: ajkr
      
      Differential Revision: D31898284
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be
      5bf9a7d5
    • C
      Fix incorrect order of comments in win_thread.cc (#9033) · 82846f41
      Calin Culianu 提交于
      Summary:
      The comments in the `#endif` section at the end of the file were in the
      wrong order.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9033
      
      Reviewed By: mrambacher
      
      Differential Revision: D31935856
      
      Pulled By: ajkr
      
      fbshipit-source-id: 24aca039993d6e27022cfe8d6434e90f2934c87c
      82846f41
    • P
      Make format-diff.sh locale-independent (#9079) · 4ec31dc8
      Peter Dillinger 提交于
      Summary:
      Force POSIX locale for calls to 'git remote' that might have
      locale-dependent formatting, as shown in https://github.com/facebook/rocksdb/issues/8731 comment
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9079
      
      Test Plan:
      manual (haven't tried on a machine with non-english default
      locale)
      
      Reviewed By: ltamasi
      
      Differential Revision: D31943092
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7dbe5915824f39f73b412cc3d1a86a2521cf76c1
      4ec31dc8
  9. 27 10月, 2021 1 次提交
  10. 23 10月, 2021 1 次提交