1. 24 2月, 2018 1 次提交
    • A
      Fix the Logger::Close() and DBImpl::Close() design pattern · dfbe52e0
      Anand Ananthabhotla 提交于
      Summary:
      The recent Logger::Close() and DBImpl::Close() implementation rely on
      calling the CloseImpl() virtual function from the destructor, which will
      not work. Refactor the implementation to have a private close helper
      function in derived classes that can be called by both CloseImpl() and
      the destructor.
      Closes https://github.com/facebook/rocksdb/pull/3528
      
      Reviewed By: gfosco
      
      Differential Revision: D7049303
      
      Pulled By: anand1976
      
      fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
      dfbe52e0
  2. 23 2月, 2018 2 次提交
  3. 13 2月, 2018 1 次提交
    • A
      Add delay before flush in CompactRange to avoid write stalling · ee1c8026
      Andrew Kryczka 提交于
      Summary:
      - Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
      - Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
      - Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
      Closes https://github.com/facebook/rocksdb/pull/3381
      
      Differential Revision: D6754983
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
      ee1c8026
  4. 10 2月, 2018 1 次提交
  5. 06 2月, 2018 1 次提交
  6. 31 1月, 2018 1 次提交
  7. 24 1月, 2018 1 次提交
  8. 18 1月, 2018 1 次提交
    • A
      fix live WALs purged while file deletions disabled · 46e599fc
      Andrew Kryczka 提交于
      Summary:
      When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.
      
      We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
      Closes https://github.com/facebook/rocksdb/pull/3341
      
      Differential Revision: D6681131
      
      Pulled By: ajkr
      
      fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
      46e599fc
  9. 17 1月, 2018 1 次提交
    • A
      Add a Close() method to DB to return status when closing a db · d0f1b49a
      Anand Ananthabhotla 提交于
      Summary:
      Currently, the only way to close an open DB is to destroy the DB
      object. There is no way for the caller to know the status. In one
      instance, the destructor encountered an error due to failure to
      close a log file on HDFS. In order to prevent silent failures, we add
      DB::Close() that calls CloseImpl() which must be implemented by its
      descendants.
      The main failure point in the destructor is closing the log file. This
      patch also adds a Close() entry point to Logger in order to get status.
      When DBOptions::info_log is allocated and owned by the DBImpl, it is
      explicitly closed by DBImpl::CloseImpl().
      Closes https://github.com/facebook/rocksdb/pull/3348
      
      Differential Revision: D6698158
      
      Pulled By: anand1976
      
      fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
      d0f1b49a
  10. 12 1月, 2018 1 次提交
  11. 19 12月, 2017 1 次提交
  12. 16 12月, 2017 1 次提交
    • Y
      BlobDB: Remove the need to get sequence number per write · 237b2925
      Yi Wu 提交于
      Summary:
      Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence).
      Closes https://github.com/facebook/rocksdb/pull/3274
      
      Differential Revision: D6571497
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909
      237b2925
  13. 12 12月, 2017 1 次提交
  14. 07 12月, 2017 1 次提交
    • A
      Preserve overlapping file endpoint invariant · 78d1a5ec
      Andrew Kryczka 提交于
      Summary:
      Fix for #2833.
      
      - In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
      - In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
      - In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
      Closes https://github.com/facebook/rocksdb/pull/2843
      
      Differential Revision: D5772387
      
      Pulled By: ajkr
      
      fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
      78d1a5ec
  15. 01 12月, 2017 1 次提交
    • M
      WritePrepared Txn: PreReleaseCallback · 18dcf7f9
      Maysam Yabandeh 提交于
      Summary:
      Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
      These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
      Closes https://github.com/facebook/rocksdb/pull/3205
      
      Differential Revision: D6438959
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
      18dcf7f9
  16. 17 11月, 2017 1 次提交
  17. 11 11月, 2017 1 次提交
  18. 03 11月, 2017 1 次提交
    • Y
      Blob DB: fix snapshot handling · 7bfa8803
      Yi Wu 提交于
      Summary:
      Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
      1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
      2. otherwise, it is the latest sequence of the file
      Closes https://github.com/facebook/rocksdb/pull/3087
      
      Differential Revision: D6182519
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
      7bfa8803
  19. 02 11月, 2017 2 次提交
    • M
      WritePrepared Txn: ValidateSnapshot · 02693f64
      Maysam Yabandeh 提交于
      Summary:
      Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
      Closes https://github.com/facebook/rocksdb/pull/3101
      
      Differential Revision: D6199405
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
      02693f64
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
  20. 01 11月, 2017 1 次提交
  21. 30 10月, 2017 1 次提交
  22. 27 10月, 2017 1 次提交
    • Y
      Blob DB: Inline small values in base DB · 5a2a6483
      Yi Wu 提交于
      Summary:
      Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
      
      Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
      
      See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
      * small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
      * small value w/ TTL: put (type, expiration, value) to base db.
      * large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
      * large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
      Closes https://github.com/facebook/rocksdb/pull/3066
      
      Differential Revision: D6142115
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
      5a2a6483
  23. 26 10月, 2017 1 次提交
    • A
      single-file bottom-level compaction when snapshot released · 9b18cc23
      Andrew Kryczka 提交于
      Summary:
      When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
      
      - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
      - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
      - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
      - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
      Closes https://github.com/facebook/rocksdb/pull/3009
      
      Differential Revision: D6062044
      
      Pulled By: ajkr
      
      fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
      9b18cc23
  24. 24 10月, 2017 1 次提交
  25. 21 10月, 2017 1 次提交
    • A
      remove unused code · f8b5bb2f
      Andrew Kryczka 提交于
      Summary:
      fixup 6a541afc. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something.
      Closes https://github.com/facebook/rocksdb/pull/3055
      
      Differential Revision: D6114132
      
      Pulled By: ajkr
      
      fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c
      f8b5bb2f
  26. 20 10月, 2017 1 次提交
  27. 19 10月, 2017 1 次提交
  28. 18 10月, 2017 1 次提交
    • Y
      Blob DB: Store blob index as kTypeBlobIndex in base db · eaaef911
      Yi Wu 提交于
      Summary:
      Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.
      
      Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
      Closes https://github.com/facebook/rocksdb/pull/3000
      
      Differential Revision: D6050044
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
      eaaef911
  29. 12 10月, 2017 1 次提交
  30. 10 10月, 2017 1 次提交
    • Y
      WritePrepared Txn: Iterator · 8c392a31
      Yi Wu 提交于
      Summary:
      On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
      Closes https://github.com/facebook/rocksdb/pull/2981
      
      Differential Revision: D6001471
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
      8c392a31
  31. 06 10月, 2017 1 次提交
  32. 04 10月, 2017 1 次提交
    • Y
      Add ValueType::kTypeBlobIndex · d1cab2b6
      Yi Wu 提交于
      Summary:
      Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
      1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
      2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
      3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
      
      The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
      
      Changes on blob db side will be in a separate patch.
      Closes https://github.com/facebook/rocksdb/pull/2886
      
      Differential Revision: D5838431
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
      d1cab2b6
  33. 29 9月, 2017 1 次提交
  34. 28 9月, 2017 1 次提交
    • Q
      Make bytes_per_sync and wal_bytes_per_sync mutable · 6a541afc
      Quinn Jarrell 提交于
      Summary:
      SUMMARY
      Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
      TEST PLAN
      ran make check
      all passed
      
      Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
      Closes https://github.com/facebook/rocksdb/pull/2893
      
      Reviewed By: yiwu-arbug
      
      Differential Revision: D5845814
      
      Pulled By: TheRushingWookie
      
      fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
      6a541afc
  35. 20 9月, 2017 1 次提交
  36. 19 9月, 2017 1 次提交
    • M
      WritePrepared Txn: Advance seq one per batch · 60beefd6
      Maysam Yabandeh 提交于
      Summary:
      By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.
      
      Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
      Closes https://github.com/facebook/rocksdb/pull/2885
      
      Differential Revision: D5837843
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
      60beefd6
  37. 13 9月, 2017 1 次提交
    • A
      Fix naming in InternalKey · 5785b1fc
      Amy Xu 提交于
      Summary:
      - Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
      Closes https://github.com/facebook/rocksdb/pull/2868
      
      Differential Revision: D5804152
      
      Pulled By: axxufb
      
      fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
      5785b1fc
  38. 12 9月, 2017 1 次提交
    • M
      write-prepared txn: call IsInSnapshot · f46464d3
      Maysam Yabandeh 提交于
      Summary:
      This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
      Closes https://github.com/facebook/rocksdb/pull/2850
      
      Differential Revision: D5787375
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
      f46464d3