1. 25 8月, 2021 1 次提交
    • H
      Refactor WriteBufferManager::CacheRep into CacheReservationManager (#8506) · 74cfe7db
      Hui Xiao 提交于
      Summary:
      Context:
      To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for  inserting/removing dummy entries to reserve cache space for memory used by the class user.
      
      - Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428
      
      - Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838
      
      - Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation.
      
      - Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm
      
      - Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back.
      
        Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used.
      Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_.
      
      - Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity
      
      - Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache.
      
        The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix.
      
        This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8506
      
      Test Plan:
      - Passing the added unit tests cache_reservation_manager_test.cc
      - Passing existing and adjusted write_buffer_manager_test.cc
      
      Reviewed By: ajkr
      
      Differential Revision: D29644135
      
      Pulled By: hx235
      
      fbshipit-source-id: 0fc93fbfe4a40bb41be85c314f8f2bafa8b741f7
      74cfe7db
  2. 24 8月, 2021 5 次提交
    • A
      Deflake write-prepared and write-unprepared tests (#8696) · c521f22a
      Andrew Kryczka 提交于
      Summary:
      The `JobContext::job_snapshot` referenced DB state but could
      have been deleted by a BG thread after the signal/unlock allowing
      shutdown to proceed. Then we would see an error like this (valgrind):
      
      ```
      ==354104== Thread 2:
      ==354104== Invalid read of size 8
      ==354104==    at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20)
      ==354104==    by 0x58F5BA: operator() (unique_ptr.h:81)
      ==354104==    by 0x58F5BA: operator() (unique_ptr.h:75)
      ==354104==    by 0x58F5BA: ~unique_ptr (unique_ptr.h:292)
      ==354104==    by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221)
      ==354104==    by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696)
      ==354104==    by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void*) (db_impl_compaction_flush.cc:2468)
      ==354104==    by 0x83707A: operator() (std_function.h:688)
      ==354104==    by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266)
      ==354104==    by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307)
      ==354104==    by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28)
      ==354104==    by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so)
      ==354104==    by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696
      
      Test Plan: unable to repro
      
      Reviewed By: pdillinger
      
      Differential Revision: D30505277
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d
      c521f22a
    • M
      Refactor TraceAnalyzer to use `TraceRecord::Handler` to avoid casting. (#8678) · f6437ea4
      Merlin Mao 提交于
      Summary:
      `TraceAnalyzer` privately inherits `TraceRecord::Handler` and `WriteBatch::Handler`.
      
      `trace_analyzer_test` can pass with this change.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8678
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30459814
      
      Pulled By: autopear
      
      fbshipit-source-id: a27f59ac4600f7c3682830c9b1d9dc79e53425be
      f6437ea4
    • J
      Add extra information to RemoteCompaction APIs (#8680) · 249b1078
      Jay Zhuang 提交于
      Summary:
      Currently, we only provide job_id in RemoteCompaction APIs, the
      main problem of `job_id` is it cannot uniquely identify a compaction job
      between DB instances or between sessions.
      Providing DB and session id to the user, which will make building cross
      DB compaction service easier.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680
      
      Test Plan: unittest
      
      Reviewed By: ajkr
      
      Differential Revision: D30444859
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
      249b1078
    • P
      Allow intentionally swallowed errors in BlockBasedFilterBlockReader (#8695) · 1a5eb33d
      Peter Dillinger 提交于
      Summary:
      To avoid getting "Didn't get expected error from Get" from
      crash test by enabling block-based filter in crash test in https://github.com/facebook/rocksdb/issues/8679.
      Basically, this applies the pattern of IGNORE_STATUS_IF_ERROR in
      full_filter_block.cc to block_based_filter_block.cc
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8695
      
      Test Plan: watch for resolution of crash test runs
      
      Reviewed By: ltamasi
      
      Differential Revision: D30496748
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f7808fcf14c0e787fe81da03fa8303244590d273
      1a5eb33d
    • P
      Fix typo in 6.24.0 HISTORY.md (#8694) · 0637c8d3
      Peter Dillinger 提交于
      Summary:
      fix typo
      
      Also, clarified change of C API signatures.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8694
      
      Test Plan: visual
      
      Reviewed By: ltamasi
      
      Differential Revision: D30492882
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ac6dc3dcefa01c91fd87fc7f50279ea5e13fa41d
      0637c8d3
  3. 23 8月, 2021 1 次提交
  4. 21 8月, 2021 8 次提交
    • L
      Update version.h and HISTORY.md for the 6.24 release (#8688) · 8c9e6897
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8688
      
      Reviewed By: ajkr, riversand963
      
      Differential Revision: D30467746
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 0fce0d42fe2fe3cb56d7a89607154b3b957f09b6
      8c9e6897
    • P
      Embed original file number in SST table properties (#8686) · 04db7648
      Peter Dillinger 提交于
      Summary:
      I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add
      file numbers to external SST files (so that more can share db session
      ids for better uniqueness properties), because of forward compatibility.
      We would have a version of RocksDB that assumes session IDs are unique
      on external SST files and therefore can't really break that invariant in
      future files.
      
      This change adds a table property for "orig_file_number" which is
      populated by normal SST files and also external SST files generated by
      SstFileWriter. SstFileWriter now keeps a db_session_id for life of the
      object and increments its own file numbers for embedding in table
      properties. (They are arguably "fake" file numbers because these numbers
      and not embedded in the file name.)
      
      While updating block_based_table_builder, I removed several unnecessary
      fields from Rep, because following the pattern would have created
      another unnecessary field.
      
      This change also updates block_based_table_reader to use this new
      property when available, which means that for newer SST files, we can
      determine the stable/original <db_session_id,file_number> unique
      identifier using just the file contents, not the file name. (It's a bit
      complicated; detailed comments in block_based_table_reader.)
      
      Also added DB host id to properties listing by sst_dump, which could be
      useful in debugging.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686
      
      Test Plan: majorly overhauled StableCacheKeys test for this change
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30457742
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445
      04db7648
    • P
      Upgrade xxhash, add Hash128 (#8634) · 22161b75
      Peter Dillinger 提交于
      Summary:
      With expected use for a 128-bit hash, xxhash library is
      upgraded to current dev (2c611a76f914828bed675f0f342d6c4199ffee1e)
      as of Aug 6 so that we can use production version of XXH3_128bits
      as new Hash128 function (added in hash128.h).
      
      To make this work, however, we have to carve out the "preview" version
      of XXH3 that is used in new SST Bloom and Ribbon filters, since that
      will not get maintenance in xxhash releases. I have consolidated all the
      relevant code into xxph3.h and made it "inline only" (no .cc file). The
      working name for this hash function is changed from XXH3p to XXPH3
      (XX Preview Hash) because the latter is easier to get working with no
      symbol name conflicts between the headers.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8634
      
      Test Plan:
      no expected change in existing functionality. For Hash128,
      added some unit tests based on those for Hash64 to ensure some basic
      properties and that the values do not change accidentally.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30173490
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 06aa542a7a28b353bc2c865b9b2f8bdfe44158e4
      22161b75
    • P
      Add Bloom/Ribbon hybrid API support (#8679) · 2a383f21
      Peter Dillinger 提交于
      Summary:
      This is essentially resurrection and fixing of the part of
      https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
      when configuring Ribbon filter, you can specify an LSM level before which
      Bloom will be used instead of Ribbon. But Bloom is only considered for
      Leveled and Universal compaction styles and file going into a known LSM
      level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
      you would expect with NewRibbonFilterPolicy.
      
      So that this can be controlled with a single int value and so that flushes
      can be distinguished from intra-L0, we consider flush to go to level -1 for
      the purposes of this option. (Explained in API comment.)
      
      I also expect the most common and recommended Ribbon configuration to
      use Bloom during flush, to minimize slowing down writes and because according
      to my estimates, Ribbon only pays off if the structure lives in memory for
      more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
      to be this mild hybrid configuration. I don't really want to add something like
      NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
      flush, Ribbon otherwise) should be considered a natural choice.
      
      C APIs also updated, but because they don't support overloading,
      rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
      rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
      configuration. While touching C API, I changed bits per key options from
      int to double.
      
      BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
      unused fields from BloomFilterPolicy.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
      
      Test Plan: new + updated tests, including crash test
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D30445797
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
      2a383f21
    • M
      Add `IteratorTraceExecutionResult` for iterator related trace records. (#8687) · baf22b4e
      Merlin Mao 提交于
      Summary:
      - Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`.
      - Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`.
      
      Added test in `DBTest2.TraceAndReplay`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30457630
      
      Pulled By: autopear
      
      fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d
      baf22b4e
    • A
      Add a PerfContext counter for secondary cache hits (#8685) · f35042ca
      anand76 提交于
      Summary:
      Add a PerfContext counter.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30453957
      
      Pulled By: anand1976
      
      fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665
      f35042ca
    • A
      Update the block_read_count/block_read_byte counters in MultiGet (#8676) · 22f2936b
      anand76 提交于
      Summary:
      MultiGet in block based table reader doesn't use BlockFetcher. As a result, the block_read_count and block_read_byte PerfContext counters were not being updated. This fixes that by updating them in MultiRead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8676
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30428680
      
      Pulled By: anand1976
      
      fbshipit-source-id: 21846efe92588fc17123665dd06733693a40126d
      22f2936b
    • A
      Fix blob callback in compaction and atomic flush (#8681) · 5efec84c
      Akanksha Mahajan 提交于
      Summary:
      Pass BlobFileCompletionCallback  in case of atomic flush and
      compaction job which is currently nullptr(default parameter).
      BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
      SstFileManager.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681
      
      Test Plan: CircleCI jobs
      
      Reviewed By: ltamasi
      
      Differential Revision: D30445998
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021
      5efec84c
  5. 20 8月, 2021 2 次提交
    • M
      Add iterator's lower and upper bounds to `TraceRecord` (#8677) · ff895338
      Merlin Mao 提交于
      Summary:
      Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set.
      
      Added test cases in `DBTest2.TraceAndManualReplay`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30438255
      
      Pulled By: autopear
      
      fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288
      ff895338
    • M
      Fix some minor issues in the Customizable infrastructure (#8566) · 9eb002fc
      mrambacher 提交于
      Summary:
      - Fix issue with OptionType::Vector when the nested item is a Customizable with no names
      - Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
      - Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
      - Fix/Add tests for null/empty customizable objects
      - Move the RegisterTestObjects from customizable_test into testutil.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30303724
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
      9eb002fc
  6. 19 8月, 2021 3 次提交
    • B
      Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add... · c625b8d0
      Baptiste Lemaire 提交于
      Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672)
      
      Summary:
      Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables.
      This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge).
      Event listeners were added to the unit tests to handle these situations.
      Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D30383109
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73
      c625b8d0
    • M
      Allow Replayer to report the results of TraceRecords. (#8657) · d10801e9
      Merlin Mao 提交于
      Summary:
      `Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
      `Replayer::Replay()` reports the results via a callback function.
      
      New interface:
      `TraceRecordResult` in "rocksdb/trace_record_result.h".
      
      `DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
      
      Reviewed By: ajkr
      
      Differential Revision: D30290216
      
      Pulled By: autopear
      
      fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
      d10801e9
    • P
      Stable cache keys on ingested SST files (#8669) · b6269b07
      Peter Dillinger 提交于
      Summary:
      Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even
      the same file ingested into different DBs sharing a block cache.
      
      Note: These new cache keys are currently only enabled when FileSystem
      does not provide GetUniqueId. For now, they are typically larger,
      so slightly less efficient.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669
      
      Test Plan: Extended unit test
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30398532
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7
      b6269b07
  7. 18 8月, 2021 3 次提交
    • Y
      Fix bug caused by releasing snapshot(s) during compaction (#8608) · 2b367fa8
      Yanqin Jin 提交于
      Summary:
      In debug mode, we are seeing assertion failure as follows
      
      ```
      db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
      Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
      ```
      
      It is caused by releasing earliest snapshot during compaction between the execution of
      `NextFromInput()` and `PrepareOutput()`.
      
      In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
      incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
      level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
      outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
      skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.
      
      Fix by taking a more conservative approach: make compaction zero out sequence number only
      if key is in the earliest snapshot when the compaction starts.
      
      Another assertion failure is
      ```
      Assertion `current_user_key_snapshot_ == last_snapshot' failed.
      ```
      
      It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608
      
      Test Plan: make check
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D30145645
      
      Pulled By: riversand963
      
      fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
      2b367fa8
    • L
      Add statistics support to integrated BlobDB (#8667) · 6878cedc
      Levi Tamasi 提交于
      Summary:
      The patch adds statistics support to the integrated BlobDB implementation,
      namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
      `BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
      `BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
      `BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
      `BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
      Note that the vast majority of the old BlobDB's tickers/histograms are not
      really applicable to the new implementation, since they e.g. pertain to calling
      dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
      tied to the legacy BlobDB's design of writing blob files synchronously when
      a write API is called. Such statistics are marked "legacy BlobDB only" in
      `statistics.h`.
      
      Fixes https://github.com/facebook/rocksdb/issues/8645 .
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667
      
      Test Plan: Ran `make check` and tested the new statistics using `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D30356884
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4
      6878cedc
    • J
      Exclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668) · 0729b287
      Jay Zhuang 提交于
      Summary:
      Just like other per_level properties.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8668
      
      Test Plan: stress_test
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30360967
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 70da2557b95c55e8081b04ebf1a909a0fe69488f
      0729b287
  8. 17 8月, 2021 2 次提交
    • A
      Add a stat to count secondary cache hits (#8666) · add68bd2
      anand76 提交于
      Summary:
      Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666
      
      Test Plan: Update a unit test in lru_cache_test
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30353816
      
      Pulled By: anand1976
      
      fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
      add68bd2
    • P
      Stable cache keys using DB session ids in SSTs (#8659) · a207c278
      Peter Dillinger 提交于
      Summary:
      Use DB session ids in SST table properties to make cache keys
      stable across DB re-open and copy / move / restore / etc.
      
      These new cache keys are currently only enabled when FileSystem does not
      provide GetUniqueId. For now, they are typically larger, so slightly
      less efficient.
      
      Relevant to https://github.com/facebook/rocksdb/issues/7405
      
      This change has a minor regression in PersistentCache functionality:
      metaindex blocks are no longer cached in PersistentCache. Table properties
      blocks already were not but ideally should be. I didn't spent effort to
      fix & test these issues because we don't believe PersistentCache is used much
      if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)
      
      FIXME: there is more to be fixed for stable cache keys on external SST files
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659
      
      Test Plan:
      new unit test added, which fails when disabling new
      functionality
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30297705
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
      a207c278
  9. 16 8月, 2021 4 次提交
  10. 14 8月, 2021 1 次提交
    • B
      Improve MemPurge sampling (#8656) · e51be2c5
      Baptiste Lemaire 提交于
      Summary:
      Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details).
      In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID.
      I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction).
      Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656
      
      Reviewed By: pdillinger
      
      Differential Revision: D30288583
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3
      e51be2c5
  11. 13 8月, 2021 1 次提交
    • M
      Code cleanup for trace replayer (#8652) · 74a652a4
      Merlin Mao 提交于
      Summary:
      - Remove extra `;` in trace_record.h
      - Remove some unnecessary `assert` in trace_record_handler.cc
      - Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
      - Update history to include the new `TraceReader::Reset()`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652
      
      Reviewed By: ajkr
      
      Differential Revision: D30276872
      
      Pulled By: autopear
      
      fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
      74a652a4
  12. 12 8月, 2021 4 次提交
    • M
      Make TraceRecord and Replayer public (#8611) · f58d2767
      Merlin Mao 提交于
      Summary:
      New public interfaces:
      `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
      `Replayer`, available in `rocksdb/utilities/replayer.h`.
      
      User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
      
      Unit tests:
      - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
      - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
      
      Reviewed By: ajkr
      
      Differential Revision: D30266329
      
      Pulled By: autopear
      
      fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
      f58d2767
    • B
      Re-add retired mempurge flag definitions for legacy-options-file temporary support. (#8650) · a53563d8
      Baptiste Lemaire 提交于
      Summary:
      Current internal regression tests pass in an old option flag `experimental_allow_mempurge` to a more recently built db.
      This flag was retired and removed in a recent PR (https://github.com/facebook/rocksdb/issues/8628), and therefore, the following error comes up : `Failed: Invalid argument: Could not find option: : experimental_allow_mempurge`.
      In this PR, I reintroduce the two flags retired in https://github.com/facebook/rocksdb/issues/8628, `experimental_allow_mempurge` and `experimental_mempurge_policy` in `db_options.cc` and mark them both as `kDeprecated`.
      This is a temporary fix to save us time to find a long term solution, which hopefully will consist in ignoring options prefixed with `experimental_` that are no longer recognized.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8650
      
      Reviewed By: pdillinger
      
      Differential Revision: D30257307
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 35303655fd2dd9789fd9e3c450e9d8009f3c1f54
      a53563d8
    • P
      Update and enhance check_format_compatible.sh (#8651) · 6450e9fc
      Peter Dillinger 提交于
      Summary:
      The last few releases overlooked adding to this test. This
      change fixes that.
      
      This change also fixes the problem of older branches not understanding
      ROCKSDB_NO_FBCODE and referencing compilers no longer supported.
      During the test, build_detect_platform is patched to force no FBCODE
      compiler usage. (We should not need to update old branches perpetually.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8651
      
      Test Plan: local run reproduces regression described in https://github.com/facebook/rocksdb/issues/8650
      
      Reviewed By: jay-zhuang, zhichao-cao
      
      Differential Revision: D30261872
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 02b447d224d7e0eb8613c63185437ded146713bc
      6450e9fc
    • J
      Add suggestion for btrfs user to disable preallocation (#8646) · 87e23587
      Jay Zhuang 提交于
      Summary:
      Add comment for `options.allow_fallocate` that btrfs
      preallocated space are not freed and a suggestion to disable
      preallocation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646
      
      Test Plan: No code change
      
      Reviewed By: ajkr
      
      Differential Revision: D30240050
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768
      87e23587
  13. 11 8月, 2021 2 次提交
    • B
      Memtable sampling for mempurge heuristic. (#8628) · e3a96c48
      Baptiste Lemaire 提交于
      Summary:
      Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
      This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
      Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
      At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
      Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
      The unit tests have been readapted to support this new API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
      
      Reviewed By: pdillinger
      
      Differential Revision: D30149315
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
      e3a96c48
    • L
      Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627) · f63331eb
      Levi Tamasi 提交于
      Summary:
      The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog`
      by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file.
      This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from
      invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit
      and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627
      
      Test Plan: `make check`
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D30147002
      
      Pulled By: ltamasi
      
      fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6
      f63331eb
  14. 10 8月, 2021 3 次提交
    • A
      Simplify GenericRateLimiter algorithm (#8602) · 82b81dc8
      Andrew Kryczka 提交于
      Summary:
      `GenericRateLimiter` slow path handles requests that cannot be satisfied
      immediately.  Such requests enter a queue, and their thread stays in `Request()`
      until they are granted or the rate limiter is stopped.  These threads are
      responsible for unblocking themselves.  The work to do so is split into two main
      duties.
      
      (1) Waiting for the next refill time.
      (2) Refilling the bytes and granting requests.
      
      Prior to this PR, the slow path logic involved a leader election algorithm to
      pick one thread to perform (1) followed by (2).  It elected the thread whose
      request was at the front of the highest priority non-empty queue since that
      request was most likely to be granted.  This algorithm was efficient in terms of
      reducing intermediate wakeups, which is a thread waking up only to resume
      waiting after finding its request is not granted.  However, the conceptual
      complexity of this algorithm was too high.  It took me a long time to draw a
      timeline to understand how it works for just one edge case yet there were so
      many.
      
      This PR drops the leader election to reduce conceptual complexity.  Now, the two
      duties can be performed by whichever thread acquires the lock first.  The risk
      of this change is increasing the number of intermediate wakeups, however, we
      took steps to mitigate that.
      
      - `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
      prevents the thundering herd problem at the next refill time. The remaining\
      threads wait on their condition variable with an unbounded duration -- thus we\
      must remember to notify them to ensure forward progress.
      - (1) is typically done by a thread at the front of a queue. This is trivial\
      when the queues are initially empty as the first choice that arrives must be\
      the only entry in its queue. When queues are initially non-empty, we achieve\
      this by having (2) notify a thread at the front of a queue (preferring higher\
      priority) to perform the next duty.
      - We do not require any additional wakeup for (2). Typically it will just be\
      done by the thread that finished (1).
      
      Combined, the second and third bullet points above suggest the refill/granting
      will typically be done by a request at the front of its queue.  This is
      important because one wakeup is saved when a granted request happens to be in an
      already running thread.
      
      Note there are a few cases that still lead to intermediate wakeup, however.  The
      first two are existing issues that also apply to the old algorithm, however, the
      third (including both subpoints) is new.
      
      - No request may be granted (only possible when rate limit dynamically\
      decreases).
      - Requests from a different queue may be granted.
      - (2) may be run by a non-front request thread causing it to not be granted even\
      if some requests in that same queue are granted. It can happen for a couple\
      (unlikely) reasons.
        - A new request may sneak in and grab the lock at the refill time, before the\
      thread finishing (1) can wake up and grab it.
        - A new request may sneak in and grab the lock and execute (1) before (2)'s\
      chosen candidate can wake up and grab the lock. Then that non-front request\
      thread performing (1) can carry over to perform (2).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602
      
      Test Plan:
      - Use existing tests. The edge cases listed in the comment are all performance\
      related; I could not really think of any related to correctness. The logic\
      looks the same whether a thread wakes up/finishes its work early/on-time/late,\
      or whether the thread is chosen vs. "steals" the work.
      - Verified write throughput and CPU overhead are basically the same with and\
        without this change, even in a rate limiter heavy workload:
      
      Test command:
      ```
      $ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
      ```
      
      Results before this PR:
      
      ```
      fillrandom   :     108.463 micros/op 9219 ops/sec;    9.0 MB/s
      7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
      ```
      
      Results after this PR:
      
      ```
      fillrandom   :     108.108 micros/op 9250 ops/sec;    9.0 MB/s
      7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
      ```
      
      Reviewed By: hx235
      
      Differential Revision: D30048013
      
      Pulled By: ajkr
      
      fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
      82b81dc8
    • L
      rocksdb: don't call LZ4_loadDictHC with null dictionary · a756fb9c
      Lucian Grijincu 提交于
      Summary: UBSAN revealed a pointer underflow when `LZ4HC_init_internal` is called with a null `start`.
      
      Reviewed By: ajkr
      
      Differential Revision: D30181874
      
      fbshipit-source-id: ca9bbac1a85c58782871d7f153af733b000cc66c
      a756fb9c
    • J
      Add an unittest for tiered storage universal compaction (#8631) · 61f83dfe
      Jay Zhuang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8631
      
      Reviewed By: siying
      
      Differential Revision: D30200385
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 0fa2bb15e74ff81762d767f234078e0fe0106c55
      61f83dfe