1. 19 12月, 2018 1 次提交
  2. 29 11月, 2018 1 次提交
    • A
      Clean up FragmentedRangeTombstoneList (#4692) · 8fe1e06c
      Abhishek Madan 提交于
      Summary:
      Removed `one_time_use` flag, which removed the need for some
      tests, and changed all `NewRangeTombstoneIterator` methods to return
      `FragmentedRangeTombstoneIterators`.
      
      These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
      and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
      
      Differential Revision: D13106570
      
      Pulled By: abhimadan
      
      fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
      8fe1e06c
  3. 22 11月, 2018 1 次提交
    • A
      Introduce RangeDelAggregatorV2 (#4649) · 457f77b9
      Abhishek Madan 提交于
      Summary:
      The old RangeDelAggregator did expensive pre-processing work
      to create a collapsed, binary-searchable representation of range
      tombstones. With FragmentedRangeTombstoneIterator, much of this work is
      now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
      in each iterator to find a covering tombstone in ShouldDelete, while
      doing minimal work in AddTombstones. The old RangeDelAggregator is still
      used during flush/compaction for now, though RangeDelAggregatorV2 will
      support those uses in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
      
      Differential Revision: D13146964
      
      Pulled By: abhimadan
      
      fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
      457f77b9
  4. 13 11月, 2018 1 次提交
    • Y
      Remove redundant member var and set options (#4631) · 05dec0c7
      Yanqin Jin 提交于
      Summary:
      In the past, both `DBImpl::atomic_flush_` and
      `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
      `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
      set correctly. This does not lead to incorrect behavior, but is a duplicate of
      information.
      
      Since `immutable_db_options_` is always there and has `atomic_flush`, we should
      use it as source of truth and remove `DBImpl::atomic_flush_`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631
      
      Differential Revision: D12928371
      
      Pulled By: riversand963
      
      fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
      05dec0c7
  5. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  6. 06 11月, 2018 1 次提交
    • A
      Add DB property for SST files kept from deletion (#4618) · fffac43c
      Andrew Kryczka 提交于
      Summary:
      This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618
      
      Differential Revision: D12898179
      
      Pulled By: ajkr
      
      fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
      fffac43c
  7. 27 10月, 2018 1 次提交
  8. 16 10月, 2018 1 次提交
  9. 11 10月, 2018 1 次提交
    • P
      support OnCompactionBegin (#4431) · 09814f2c
      Peter Pei 提交于
      Summary:
      fix #4288
      
      Add `OnCompactionBegin` support to `rocksdb::EventListener`.
      
      Currently, we only have these three callbacks:
      
      - OnFlushBegin
      - OnFlushCompleted
      - OnCompactionCompleted
      
      As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.
      
      This PR is a try to implement the support of `OnCompactionBegin`.
      
      Hope it is useful to you.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431
      
      Differential Revision: D10055515
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
      09814f2c
  10. 10 10月, 2018 1 次提交
    • A
      Handle mixed slowdown/no_slowdown writer properly (#4475) · 854a4be0
      Anand Ananthabhotla 提交于
      Summary:
      There is a bug when the write queue leader is blocked on a write
      delay/stop, and the queue has writers with WriteOptions::no_slowdown set
      to true. They are not woken up until the write stall is cleared.
      
      The fix introduces a dummy writer inserted at the tail to indicate a
      write stall and prevent further inserts into the queue, and a condition
      variable that writers who can tolerate slowdown wait on before adding
      themselves to the queue. The leader calls WriteThread::BeginWriteStall()
      to add the dummy writer and then walk the queue to fail any writers with
      no_slowdown set. Once the stall clears, the leader calls
      WriteThread::EndWriteStall() to remove the dummy writer and signal the
      condition variable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475
      
      Differential Revision: D10285827
      
      Pulled By: anand1976
      
      fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
      854a4be0
  11. 09 10月, 2018 2 次提交
    • Z
      move dump stats to a separate thread (#4382) · cac87fcf
      Zhongyi Xie 提交于
      Summary:
      Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
      We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.
      
      Tested with db_bench using `--stats_dump_period_sec=20` in command line:
      > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      
      LOG content:
      > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
      ** DB Stats **
      Uptime(secs): 20.0 total, 20.0 interval
      Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
      Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
      Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
      Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
      Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
      Interval stall: 00:00:0.012 H:M:S, 0.1 percent
      ** Compaction Stats [default] **
      Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382
      
      Differential Revision: D9933051
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
      cac87fcf
    • D
      Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391) · 27090ae8
      DorianZheng 提交于
      Summary:
      - Fix DBImpl API race condition
      
      The timeline of execution flow is as follow:
      ```
      timeline              user_thread1                      user_thread2
      t1   |     cfh = GetColumnFamilyHandleUnlocked(0)
      t2   |     id1 = cfh->GetID()
      t3   |                                                GetColumnFamilyHandleUnlocked(1)
      t4   |     id2 = cfh->GetID()
           V
      ```
      The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id`
      
      - Expose ColumnFamily ID to compaction event listener
      
      - Fix the return status of `DBImpl::GetLatestSequenceForKey`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391
      
      Differential Revision: D10221243
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993
      27090ae8
  12. 27 9月, 2018 1 次提交
    • Y
      Improve log handling when recover without flush (#4405) · dc813e4b
      Yi Wu 提交于
      Summary:
      Improve log handling when avoid_flush_during_recovery=true.
      1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253.
      2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405
      
      Differential Revision: D9953933
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56
      dc813e4b
  13. 18 9月, 2018 1 次提交
  14. 16 9月, 2018 1 次提交
    • A
      Auto recovery from out of space errors (#4164) · a27fce40
      Anand Ananthabhotla 提交于
      Summary:
      This commit implements automatic recovery from a Status::NoSpace() error
      during background operations such as write callback, flush and
      compaction. The broad design is as follows -
      1. Compaction errors are treated as soft errors and don't put the
      database in read-only mode. A compaction is delayed until enough free
      disk space is available to accomodate the compaction outputs, which is
      estimated based on the input size. This means that users can continue to
      write, and we rely on the WriteController to delay or stop writes if the
      compaction debt becomes too high due to persistent low disk space
      condition
      2. Errors during write callback and flush are treated as hard errors,
      i.e the database is put in read-only mode and goes back to read-write
      only fater certain recovery actions are taken.
      3. Both types of recovery rely on the SstFileManagerImpl to poll for
      sufficient disk space. We assume that there is a 1-1 mapping between an
      SFM and the underlying OS storage container. For cases where multiple
      DBs are hosted on a single storage container, the user is expected to
      allocate a single SFM instance and use the same one for all the DBs. If
      no SFM is specified by the user, DBImpl::Open() will allocate one, but
      this will be one per DB and each DB will recover independently. The
      recovery implemented by SFM is as follows -
        a) On the first occurance of an out of space error during compaction,
      subsequent
        compactions will be delayed until the disk free space check indicates
        enough available space. The required space is computed as the sum of
        input sizes.
        b) The free space check requirement will be removed once the amount of
        free space is greater than the size reserved by in progress
        compactions when the first error occured
        c) If the out of space error is a hard error, a background thread in
        SFM will poll for sufficient headroom before triggering the recovery
        of the database and putting it in write-only mode. The headroom is
        calculated as the sum of the write_buffer_size of all the DB instances
        associated with the SFM
      4. EventListener callbacks will be called at the start and completion of
      automatic recovery. Users can disable the auto recov ery in the start
      callback, and later initiate it manually by calling DB::Resume()
      
      Todo:
      1. More extensive testing
      2. Add disk full condition to db_stress (follow-on PR)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
      
      Differential Revision: D9846378
      
      Pulled By: anand1976
      
      fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
      a27fce40
  15. 30 8月, 2018 1 次提交
    • M
      Avoiding write stall caused by manual flushes (#4297) · 927f2749
      Mikhail Antonov 提交于
      Summary:
      Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
      
      One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
      
      This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
      
      Differential Revision: D9420705
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
      927f2749
  16. 25 8月, 2018 1 次提交
    • Y
      Refactor flush request queueing and processing (#3952) · 7daae512
      Yanqin Jin 提交于
      Summary:
      RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB.
      
      This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it.
      
      This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952
      
      Differential Revision: D8529933
      
      Pulled By: riversand963
      
      fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
      7daae512
  17. 24 8月, 2018 1 次提交
  18. 11 8月, 2018 1 次提交
  19. 01 8月, 2018 1 次提交
    • S
      Trace and Replay for RocksDB (#3837) · 12b6cdee
      Sagar Vemuri 提交于
      Summary:
      A framework for tracing and replaying RocksDB operations.
      
      A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.
      
      - Column-families are supported
      - Multi-threaded tracing is supported.
      - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
      - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.
      
      Currently supported DB operations:
      - Writes:
      -- Put
      -- Merge
      -- Delete
      -- SingleDelete
      -- DeleteRange
      -- Write
      - Reads:
      -- Get (point lookups)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837
      
      Differential Revision: D7974837
      
      Pulled By: sagar0
      
      fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
      12b6cdee
  20. 24 7月, 2018 1 次提交
    • M
      WriteUnPrepared: Implement unprepared batches for transactions (#4104) · ea212e53
      Manuel Ung 提交于
      Summary:
      This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.
      
      Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.
      
      A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104
      
      Differential Revision: D8785717
      
      Pulled By: lth
      
      fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
      ea212e53
  21. 07 7月, 2018 1 次提交
    • M
      WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078) · b9846370
      Manuel Ung 提交于
      Summary:
      This adds support for recovering WriteUnprepared transactions through the following changes:
      - The information in `RecoveredTransaction` is extended so that it can reference multiple batches.
      - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not.
      - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway.
      
      Commit/Rollback of live transactions is still unimplemented and will come later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078
      
      Differential Revision: D8703382
      
      Pulled By: lth
      
      fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a
      b9846370
  22. 29 6月, 2018 2 次提交
    • M
      WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069) · 8ad63a4b
      Manuel Ung 提交于
      Summary:
      This adds a new WAL marker of type kTypeBeginUnprepareXID.
      
      Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.
      
      Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
      Closes https://github.com/facebook/rocksdb/pull/4069
      
      Differential Revision: D8675099
      
      Pulled By: lth
      
      fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
      8ad63a4b
    • A
      Allow DB resume after background errors (#3997) · 52d4c9b7
      Anand Ananthabhotla 提交于
      Summary:
      Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
      1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
      2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
      3. Provide an API for the user to clear the error and resume the DB instance
      
      This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
      Closes https://github.com/facebook/rocksdb/pull/3997
      
      Differential Revision: D8653831
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
      52d4c9b7
  23. 16 6月, 2018 1 次提交
  24. 15 5月, 2018 1 次提交
  25. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  26. 28 4月, 2018 1 次提交
    • H
      Add max_subcompactions as a compaction option · ed7a95b2
      Huachao Huang 提交于
      Summary:
      Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
      I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
      Closes https://github.com/facebook/rocksdb/pull/3775
      
      Differential Revision: D7792357
      
      Pulled By: ajkr
      
      fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
      ed7a95b2
  27. 27 4月, 2018 1 次提交
  28. 08 4月, 2018 1 次提交
    • M
      WritePrepared Txn: add stats · bde1c1a7
      Maysam Yabandeh 提交于
      Summary:
      Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
      Closes https://github.com/facebook/rocksdb/pull/3683
      
      Differential Revision: D7529393
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
      bde1c1a7
  29. 06 4月, 2018 1 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
  30. 03 4月, 2018 2 次提交
    • M
      WritePrepared Txn: smallest_prepare optimization · b225de7e
      Maysam Yabandeh 提交于
      Summary:
      The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
      To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
      Closes https://github.com/facebook/rocksdb/pull/3649
      
      Differential Revision: D7388630
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
      b225de7e
    • A
      Enable cancelling manual compactions if they hit the sfm size limit · 1579626d
      Amy Tai 提交于
      Summary:
      Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
      Closes https://github.com/facebook/rocksdb/pull/3670
      
      Differential Revision: D7457683
      
      Pulled By: amytai
      
      fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
      1579626d
  31. 29 3月, 2018 2 次提交
    • M
      WritePrepared Txn: make recoverable state visible after flush · 0377ff9d
      Maysam Yabandeh 提交于
      Summary:
      Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
      Closes https://github.com/facebook/rocksdb/pull/3661
      
      Differential Revision: D7424577
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
      0377ff9d
    • Y
      Fix race condition causing double deletion of ssts · 1f5def16
      Yanqin Jin 提交于
      Summary:
      Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files,  but this PR targets sst.
      Also add a unit test to verify that this PR fixes the issue.
      
      The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
      ```
      timeline              user_thread                background_compaction thread
      t1   |                                          FindObsoleteFiles(full_scan=false)
      t2   |     FindObsoleteFiles(full_scan=true)
      t3   |                                          PurgeObsoleteFiles
      t4   |     PurgeObsoleteFiles
           V
      ```
      When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
      To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.
      
      ajkr could you take a look and comment? Thanks!
      Closes https://github.com/facebook/rocksdb/pull/3638
      
      Differential Revision: D7384372
      
      Pulled By: riversand963
      
      fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
      1f5def16
  32. 27 3月, 2018 1 次提交
    • M
      Fix race condition via concurrent FlushWAL · 35a4469b
      Maysam Yabandeh 提交于
      Summary:
      Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.
      
      Fixes #3599
      Closes https://github.com/facebook/rocksdb/pull/3656
      
      Differential Revision: D7405608
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
      35a4469b
  33. 23 3月, 2018 3 次提交
    • Z
      FlushReason improvement · 1cbc96d2
      Zhongyi Xie 提交于
      Summary:
      Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full"
      
      > $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304
      $ grep 'flush_reason' /dev/shm/dbbench/LOG
      ...
      2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"}
      2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
      2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"}
      2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
      
      With the fix:
      > 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"}
      2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"}
      2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
      2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"}
      2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"}
      2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
      2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"}
      Closes https://github.com/facebook/rocksdb/pull/3627
      
      Differential Revision: D7328772
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8
      1cbc96d2
    • A
      Rename function for handling WAL write error · 4d51feab
      Andrew Kryczka 提交于
      Summary:
      It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback.
      Closes https://github.com/facebook/rocksdb/pull/3485
      
      Differential Revision: D6955787
      
      Pulled By: ajkr
      
      fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10
      4d51feab
    • M
      WritePrepared Txn: fix race condition on publishing seq · 7429b20e
      Maysam Yabandeh 提交于
      Summary:
      This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
      Closes https://github.com/facebook/rocksdb/pull/3641
      
      Differential Revision: D7361508
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
      7429b20e
  34. 16 3月, 2018 1 次提交