1. 07 7月, 2021 1 次提交
    • B
      Fix clang_analyzer failure (#8492) · 714ce504
      Baptiste Lemaire 提交于
      Summary:
      Previously, the following command:
      ```USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j$(nproc) analyze```
      was raising an error/warning the new_mem could potentially be a `nullptr`. This error appeared due to code changes from https://github.com/facebook/rocksdb/issues/8454, including an if-statement containing "`... && new_mem != nullptr && ...`", which made the analyzer believe that past this `if`-statement, a `new_mem==nullptr` was a possible scenario.
      This code patch simply introduces `assert`s and removes this condition in the `if`-statement.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8492
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D29571275
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 75d72246b70ebbbae7dea11ccb5778686d8bcbea
      714ce504
  2. 02 7月, 2021 1 次提交
    • B
      Memtable "MemPurge" prototype (#8454) · 9dc887ec
      Baptiste Lemaire 提交于
      Summary:
      Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
      The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
      Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
      One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
      
      Reviewed By: anand1976
      
      Differential Revision: D29433971
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
      9dc887ec
  3. 06 5月, 2021 1 次提交
    • M
      Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262) · 8948dc85
      mrambacher 提交于
      Summary:
      The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
      
      Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
      
      Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
      
      Reviewed By: pdillinger
      
      Differential Revision: D28226540
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
      8948dc85
  4. 22 4月, 2021 1 次提交
    • A
      Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898) · 596e9008
      Akanksha Mahajan 提交于
      Summary:
      When WriteBufferManager is shared across DBs and column families
      to maintain memory usage under a limit, OOMs have been observed when flush cannot
      finish but writes continuously insert to memtables.
      In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
      this change will stall incoming writers until flush is completed and memory_usage
      drops.
      
      Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
      (memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.
      
      DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
      (which adds dummy stall object to the writer's queue).
      Then DB is blocked on a state State::Blocked (current write doesn't go
      through). WBStallInterface object maintained by every DB instance is added to the queue of
      WriteBufferManager.
      
      If multiple DBs tries to write during this stall, they will also be
      blocked when check WriteBufferManager::ShouldStall() returns true.
      
      End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
      waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
      to complete and avoid continous stalling if memory usage remains close to buffer_size.
      
      WriterBufferManager::EndWriteStall() is called,
      which removes all instances from its queue and signal them to continue.
      Their state is changed to State::Running and they are unblocked. DBImpl
      then signal all incoming writers of that DB to continue by calling
      write_thread_.EndWriteStall() (which removes dummy stall object from the
      queue).
      
      DB instance creates WBMStallInterface which is an interface to block and
      signal DBs during stall.
      When DB needs to be blocked or signalled by WriteBufferManager,
      state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898
      
      Test Plan: Added a new test db/db_write_buffer_manager_test.cc
      
      Reviewed By: anand1976
      
      Differential Revision: D26093227
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
      596e9008
  5. 08 4月, 2021 1 次提交
    • G
      Fix flush reason attribution (#8150) · 48cd7a3a
      Giuseppe Ottaviano 提交于
      Summary:
      Current flush reason attribution is misleading or incorrect (depending on what the original intention was):
      
      - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
      - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`
      
      This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27569645
      
      Pulled By: ot
      
      fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
      48cd7a3a
  6. 19 3月, 2021 1 次提交
    • P
      Revamp WriteController (#8064) · e7a60d01
      Peter Dillinger 提交于
      Summary:
      WriteController had a number of issues:
      * It could introduce a delay of 1ms even if the write rate never exceeded the
      configured delayed_write_rate.
      * The DB-wide delayed_write_rate could be exceeded in a number of ways
      with multiple column families:
        * Wiping all pending delay "debts" when another column family joins
        the delay with GetDelayToken().
        * Resetting last_refill_time_ to (now + sleep amount) means each
        column family can write with delayed_write_rate for large writes.
        * Updating bytes_left_ for a partial refill without updating
        last_refill_time_ would essentially give out random bonuses,
        especially to medium-sized writes.
      
      Now the code is much simpler, with these issues fixed. See comments in
      the new code and new (replacement) tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064
      
      Test Plan: new tests, better than old tests
      
      Reviewed By: mrambacher
      
      Differential Revision: D27064936
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
      e7a60d01
  7. 15 3月, 2021 1 次提交
    • M
      Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033) · 3dff28cf
      mrambacher 提交于
      Summary:
      For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.
      
      For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
      
      There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.
      
      Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
      
      6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
      6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
      PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)
      
      (Note that the times for 6.18 are prior to revert of the SystemClock).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
      
      Reviewed By: pdillinger
      
      Differential Revision: D27014563
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
      3dff28cf
  8. 26 1月, 2021 1 次提交
    • M
      Add a SystemClock class to capture the time functions of an Env (#7858) · 12f11373
      mrambacher 提交于
      Summary:
      Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
      
      Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
      
      There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
      
      Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
      
      Reviewed By: pdillinger
      
      Differential Revision: D26006406
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
      12f11373
  9. 23 12月, 2020 2 次提交
  10. 19 12月, 2020 1 次提交
    • C
      Track WAL obsoletion when updating empty CF's log number (#7781) · fbce7a38
      Cheng Chang 提交于
      Summary:
      In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk.
      
      For example, consider there are 3 column families: 0, 1, 2:
      1. initially, all the column families' log number is 1;
      2. write some data to cf0, and flush cf0, but the flush is pending;
      3. now a new WAL 2 is created;
      4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number);
      5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk;
      6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2;
      7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3;
      8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist.
      
      When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption.
      
      This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781
      
      Test Plan:
      watch existing tests and stress tests to pass.
      `make -j48 blackbox_crash_test` on devserver
      
      Reviewed By: ltamasi
      
      Differential Revision: D25631695
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec
      fbce7a38
  11. 10 12月, 2020 1 次提交
    • A
      Add further tests to ASSERT_STATUS_CHECKED (2) (#7698) · 8ff6557e
      Adam Retter 提交于
      Summary:
      Second batch of adding more tests to ASSERT_STATUS_CHECKED.
      
      * external_sst_file_basic_test
      * checkpoint_test
      * db_wal_test
      * db_block_cache_test
      * db_logical_block_size_cache_test
      * db_blob_index_test
      * optimistic_transaction_test
      * transaction_test
      * point_lock_manager_test
      * write_prepared_transaction_test
      * write_unprepared_transaction_test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698
      
      Reviewed By: cheng-chang
      
      Differential Revision: D25441664
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00
      8ff6557e
  12. 08 12月, 2020 1 次提交
    • M
      Change ErrorHandler methods to return const Status& (#7539) · db03172d
      mrambacher 提交于
      Summary:
      This change eliminates the need for a lot of the PermitUncheckedError calls on return from ErrorHandler methods.  The calls are no longer needed as the status is returned as a reference rather than a copy.  Additionally, this means that the originating status (recovery_error_, bg_error_) is not cleared implicitly as a result of calling one of these methods.
      
      For this class, I do not know if the proper behavior should be to call PermitUncheckedError in the destructor or if the checked state should be cleared when the status is cleared.  I did tests both ways.  Without the code in the destructor, the status will need to be cleared in at least some of the places where it is set to OK.  When running tests, I found no instances where this class was destructed with a non-OK, non-checked Status.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7539
      
      Reviewed By: anand1976
      
      Differential Revision: D25340565
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1730c035c81a475875ea745226112030ec25136c
      db03172d
  13. 03 12月, 2020 1 次提交
    • Y
      Fix assertion failure in bg flush (#7362) · e062a719
      Yanqin Jin 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
      - atomic flush is disabled.
      - a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.
      
      Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.
      
      ```
        assert(cfd->imm()->NumNotFlushed() != 0);
        assert(cfd->imm()->IsFlushPending());
      ```
      
      Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362
      
      Test Plan:
      make check
      Also run stress test successfully for 10 times.
      ```
      make crash_test
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D25172996
      
      Pulled By: riversand963
      
      fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
      e062a719
  14. 07 11月, 2020 1 次提交
    • C
      Track WAL in MANIFEST: LogAndApply WAL events to MANIFEST (#7601) · 1e40696d
      Cheng Chang 提交于
      Summary:
      When a WAL is synced, an edit is written to MANIFEST.
      After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601
      
      Test Plan:
      `track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
      Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
      Watch all tests to pass.
      
      Reviewed By: ltamasi
      
      Differential Revision: D24553957
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
      1e40696d
  15. 13 10月, 2020 1 次提交
  16. 08 10月, 2020 1 次提交
  17. 07 10月, 2020 1 次提交
    • J
      Fix StallWrite crash with mixed of slowdown/no_slowdown writes (#7508) · 53089038
      Jay Zhuang 提交于
      Summary:
      `BeginWriteStall()` removes no_slowdown write from the write
      list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
      thought all write list has valid `link_newer` and failed to create link
      for all writers.
      It caused flaky test and SegFault for release build.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508
      
      Test Plan: Add unittest to reproduce the issue.
      
      Reviewed By: anand1976
      
      Differential Revision: D24126601
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
      53089038
  18. 03 10月, 2020 3 次提交
  19. 02 10月, 2020 1 次提交
  20. 30 9月, 2020 1 次提交
  21. 17 9月, 2020 1 次提交
  22. 15 9月, 2020 1 次提交
    • A
      Add a new IOStatus subcode to indicate that writes are fenced off (#7374) · 18a3227b
      anand76 提交于
      Summary:
      In a distributed file system, directory ownership is enforced by fencing
      off the previous owner once they've been preempted by a new owner. This
      PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
      Once this error is returned for a file write, the DB is put in read-only
      mode and not allowed to resume in read-write mode.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374
      
      Test Plan: Add new unit tests in ```error_handler_fs_test```
      
      Reviewed By: riversand963
      
      Differential Revision: D23687777
      
      Pulled By: anand1976
      
      fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
      18a3227b
  23. 22 8月, 2020 1 次提交
    • A
      Bug Fix for memtables not trimmed down. (#7296) · 38446126
      Akanksha Mahajan 提交于
      Summary:
      When a memtable is trimmed in MemTableListVersion, the memtable
      is only added to delete list if it is
      the last reference. However it is not the last reference as it is held
      by the super version. But the super version would not be switched if the
      delete list is empty. So the memtable is never destroyed and memory
      usage increases beyond write_buffer_size +
      max_write_buffer_size_to_maintain.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296
      
      Test Plan:
      1.  ./db_bench -benchmarks=randomtransaction
      -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
      -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
      --transaction_set_snapshot
      
      Reviewed By: ltamasi
      
      Differential Revision: D23267395
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
      38446126
  24. 08 8月, 2020 1 次提交
  25. 16 7月, 2020 1 次提交
    • Z
      Auto resume the DB from Retryable IO Error (#6765) · a10f12ed
      Zhichao Cao 提交于
      Summary:
      In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
      
      Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D21916789
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
      a10f12ed
  26. 03 7月, 2020 1 次提交
  27. 29 5月, 2020 1 次提交
    • Y
      Add timestamp to delete (#6253) · 961c7590
      Yanqin Jin 提交于
      Summary:
      Preliminary user-timestamp support for delete.
      
      If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.
      
      Implementation
      A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
      The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.
      
      Test plan (dev server)
      ```
      make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253
      
      Reviewed By: ltamasi
      
      Differential Revision: D20995328
      
      Pulled By: riversand963
      
      fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
      961c7590
  28. 28 4月, 2020 1 次提交
  29. 28 3月, 2020 1 次提交
    • Z
      Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487) · 42468881
      Zhichao Cao 提交于
      Summary:
      In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
      
      The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
      
      Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D20685017
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
      42468881
  30. 03 3月, 2020 1 次提交
  31. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  32. 19 2月, 2020 1 次提交
    • A
      Fix concurrent full purge and WAL recycling (#5900) · c6abe30e
      Andrew Kryczka 提交于
      Summary:
      We were removing the file from `log_recycle_files_` before renaming it
      with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside
      the DB mutex, it was possible for a concurrent full purge to sneak in
      and delete the file before it could be renamed. Consequently, `SwitchMemtable()`
      would fail and the DB would enter read-only mode.
      
      The fix is to hold the old file number in `log_recycle_files_` until
      after the file has been renamed. Full purge uses that list to decide
      which files to keep, so it can no longer delete a file pending recycling.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900
      
      Test Plan: new unit test
      
      Differential Revision: D19771719
      
      Pulled By: ajkr
      
      fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8
      c6abe30e
  33. 11 2月, 2020 1 次提交
  34. 28 1月, 2020 1 次提交
  35. 14 12月, 2019 1 次提交
    • L
      Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169) · 6d54eb3d
      Levi Tamasi 提交于
      Summary:
      We have observed an increase in CPU load caused by frequent calls to
      `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
      when using `max_write_buffer_size_to_maintain` to limit the amount of
      memtable history maintained for transaction conflict checking. As it turns out,
      this is caused by the code creating and installing a new `SuperVersion` even if
      no memtables were actually trimmed. The patch adds a check to avoid this.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169
      
      Test Plan:
      Compared `perf` output for
      
      ```
      ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
      ```
      
      before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` ->
      `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape`
      no longer registers in the `perf` report.
      
      Differential Revision: D19031509
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20
      6d54eb3d
  36. 13 12月, 2019 2 次提交
    • J
      Support concurrent CF iteration and drop (#6147) · c2029f97
      Jermy Li 提交于
      Summary:
      It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.
      
      This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.
      
      fixed https://github.com/facebook/rocksdb/issues/5982
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147
      
      Differential Revision: D18926378
      
      fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
      c2029f97
    • C
      wait pending memtable writes on file ingestion or compact range (#6113) · a8445912
      Connor 提交于
      Summary:
      **Summary:**
      This PR fixes two unordered_write related issues:
      - ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
      - compact range may cause memtable is flushed before pending unordered write finished
          1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
          2.  there are some pending writes but memtable is already flushed
          3.  the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
          4.  pending-writes write to newer created memtable
          5. there is a restart
          6. missing the previous pending-writes because WAL is removed but they aren't included in SST.
      
      **How to solve:**
      - Wait pending memtable writes before ingestion job check memtable key range
      - Wait pending memtable writes before flush memtable.
      **Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**
      
      **Test Plan:**
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113
      
      Differential Revision: D18895674
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
      a8445912