1. 07 12月, 2017 1 次提交
    • A
      Preserve overlapping file endpoint invariant · 78d1a5ec
      Andrew Kryczka 提交于
      Summary:
      Fix for #2833.
      
      - In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
      - In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
      - In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
      Closes https://github.com/facebook/rocksdb/pull/2843
      
      Differential Revision: D5772387
      
      Pulled By: ajkr
      
      fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
      78d1a5ec
  2. 01 12月, 2017 1 次提交
    • M
      WritePrepared Txn: PreReleaseCallback · 18dcf7f9
      Maysam Yabandeh 提交于
      Summary:
      Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
      These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
      Closes https://github.com/facebook/rocksdb/pull/3205
      
      Differential Revision: D6438959
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
      18dcf7f9
  3. 17 11月, 2017 1 次提交
  4. 11 11月, 2017 1 次提交
  5. 03 11月, 2017 1 次提交
    • Y
      Blob DB: fix snapshot handling · 7bfa8803
      Yi Wu 提交于
      Summary:
      Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
      1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
      2. otherwise, it is the latest sequence of the file
      Closes https://github.com/facebook/rocksdb/pull/3087
      
      Differential Revision: D6182519
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
      7bfa8803
  6. 02 11月, 2017 2 次提交
    • M
      WritePrepared Txn: ValidateSnapshot · 02693f64
      Maysam Yabandeh 提交于
      Summary:
      Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
      Closes https://github.com/facebook/rocksdb/pull/3101
      
      Differential Revision: D6199405
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
      02693f64
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
  7. 01 11月, 2017 1 次提交
  8. 30 10月, 2017 1 次提交
  9. 27 10月, 2017 1 次提交
    • Y
      Blob DB: Inline small values in base DB · 5a2a6483
      Yi Wu 提交于
      Summary:
      Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
      
      Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
      
      See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
      * small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
      * small value w/ TTL: put (type, expiration, value) to base db.
      * large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
      * large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
      Closes https://github.com/facebook/rocksdb/pull/3066
      
      Differential Revision: D6142115
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
      5a2a6483
  10. 26 10月, 2017 1 次提交
    • A
      single-file bottom-level compaction when snapshot released · 9b18cc23
      Andrew Kryczka 提交于
      Summary:
      When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
      
      - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
      - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
      - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
      - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
      Closes https://github.com/facebook/rocksdb/pull/3009
      
      Differential Revision: D6062044
      
      Pulled By: ajkr
      
      fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
      9b18cc23
  11. 24 10月, 2017 1 次提交
  12. 21 10月, 2017 1 次提交
    • A
      remove unused code · f8b5bb2f
      Andrew Kryczka 提交于
      Summary:
      fixup 6a541afc. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something.
      Closes https://github.com/facebook/rocksdb/pull/3055
      
      Differential Revision: D6114132
      
      Pulled By: ajkr
      
      fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c
      f8b5bb2f
  13. 20 10月, 2017 1 次提交
  14. 19 10月, 2017 1 次提交
  15. 18 10月, 2017 1 次提交
    • Y
      Blob DB: Store blob index as kTypeBlobIndex in base db · eaaef911
      Yi Wu 提交于
      Summary:
      Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.
      
      Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
      Closes https://github.com/facebook/rocksdb/pull/3000
      
      Differential Revision: D6050044
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
      eaaef911
  16. 12 10月, 2017 1 次提交
  17. 10 10月, 2017 1 次提交
    • Y
      WritePrepared Txn: Iterator · 8c392a31
      Yi Wu 提交于
      Summary:
      On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
      Closes https://github.com/facebook/rocksdb/pull/2981
      
      Differential Revision: D6001471
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
      8c392a31
  18. 06 10月, 2017 1 次提交
  19. 04 10月, 2017 1 次提交
    • Y
      Add ValueType::kTypeBlobIndex · d1cab2b6
      Yi Wu 提交于
      Summary:
      Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
      1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
      2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
      3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
      
      The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
      
      Changes on blob db side will be in a separate patch.
      Closes https://github.com/facebook/rocksdb/pull/2886
      
      Differential Revision: D5838431
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
      d1cab2b6
  20. 29 9月, 2017 1 次提交
  21. 28 9月, 2017 1 次提交
    • Q
      Make bytes_per_sync and wal_bytes_per_sync mutable · 6a541afc
      Quinn Jarrell 提交于
      Summary:
      SUMMARY
      Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
      TEST PLAN
      ran make check
      all passed
      
      Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
      Closes https://github.com/facebook/rocksdb/pull/2893
      
      Reviewed By: yiwu-arbug
      
      Differential Revision: D5845814
      
      Pulled By: TheRushingWookie
      
      fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
      6a541afc
  22. 20 9月, 2017 1 次提交
  23. 19 9月, 2017 1 次提交
    • M
      WritePrepared Txn: Advance seq one per batch · 60beefd6
      Maysam Yabandeh 提交于
      Summary:
      By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.
      
      Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
      Closes https://github.com/facebook/rocksdb/pull/2885
      
      Differential Revision: D5837843
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
      60beefd6
  24. 13 9月, 2017 1 次提交
    • A
      Fix naming in InternalKey · 5785b1fc
      Amy Xu 提交于
      Summary:
      - Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
      Closes https://github.com/facebook/rocksdb/pull/2868
      
      Differential Revision: D5804152
      
      Pulled By: axxufb
      
      fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
      5785b1fc
  25. 12 9月, 2017 1 次提交
    • M
      write-prepared txn: call IsInSnapshot · f46464d3
      Maysam Yabandeh 提交于
      Summary:
      This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
      Closes https://github.com/facebook/rocksdb/pull/2850
      
      Differential Revision: D5787375
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
      f46464d3
  26. 08 9月, 2017 1 次提交
  27. 01 9月, 2017 1 次提交
  28. 31 8月, 2017 2 次提交
  29. 25 8月, 2017 1 次提交
  30. 19 8月, 2017 1 次提交
    • A
      perf_context measure user bytes read · ed0a4c93
      Andrew Kryczka 提交于
      Summary:
      With this PR, we can measure read-amp for queries where perf_context is enabled as follows:
      
      ```
      SetPerfLevel(kEnableCount);
      Get(1, "foo");
      double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes);
      SetPerfLevel(kDisable);
      ```
      
      Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp.
      Closes https://github.com/facebook/rocksdb/pull/2749
      
      Differential Revision: D5647240
      
      Pulled By: ajkr
      
      fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12
      ed0a4c93
  31. 10 8月, 2017 1 次提交
    • A
      add VerifyChecksum() to db.h · 7848f0b2
      Aaron G 提交于
      Summary:
      We need a tool to check any sst file corruption in the db.
      It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
      Closes https://github.com/facebook/rocksdb/pull/2498
      
      Differential Revision: D5324269
      
      Pulled By: lightmark
      
      fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
      7848f0b2
  32. 09 8月, 2017 1 次提交
    • C
      Try to repair db with wal_dir option, avoid leak some WAL files · d97a72d6
      Chang Liu 提交于
      Summary:
      We should search wal_dir in Repairer::FindFiles function, and avoid use
      LogFileNmae(dbname, number) to get WAL file's name, which will get a wrong
      WAL filename. as following:
      
      ```
      [WARN] [/home/liuchang/Workspace/rocksdb/db/repair.cc:310] Log #3: ignoring conversion error: IO error: While opening a file for sequentially reading: /tmp/rocksdbtest-1000/repair_test/000003.log: No such file or directory
      ```
        I have added a new test case to repair_test.cc, which try to repair db with all WAL options.
      Signed-off-by: NChang Liu <liuchang0812@gmail.com>
      Closes https://github.com/facebook/rocksdb/pull/2692
      
      Differential Revision: D5575888
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5b93e9f85cddc01663ccecd87631fa723ac466a3
      d97a72d6
  33. 04 8月, 2017 1 次提交
    • A
      Introduce bottom-pri thread pool for large universal compactions · cc01985d
      Andrew Kryczka 提交于
      Summary:
      When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
      
      This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
      
      - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
      - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
      - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
      - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
      - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
      Closes https://github.com/facebook/rocksdb/pull/2580
      
      Differential Revision: D5422916
      
      Pulled By: ajkr
      
      fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
      cc01985d
  34. 25 7月, 2017 1 次提交
    • S
      Add Iterator::Refresh() · e67b35c0
      Siying Dong 提交于
      Summary:
      Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
      Closes https://github.com/facebook/rocksdb/pull/2621
      
      Differential Revision: D5464500
      
      Pulled By: siying
      
      fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
      e67b35c0
  35. 22 7月, 2017 2 次提交
  36. 16 7月, 2017 1 次提交
  37. 25 6月, 2017 1 次提交
    • M
      Optimize for serial commits in 2PC · 499ebb3a
      Maysam Yabandeh 提交于
      Summary:
      Throughput: 46k tps in our sysbench settings (filling the details later)
      
      The idea is to have the simplest change that gives us a reasonable boost
      in 2PC throughput.
      
      Major design changes:
      1. The WAL file internal buffer is not flushed after each write. Instead
      it is flushed before critical operations (WAL copy via fs) or when
      FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
      via mutex_.
      2. Use two sequence numbers: last seq, and last seq for write. Last seq
      is the last visible sequence number for reads. Last seq for write is the
      next sequence number that should be used to write to WAL/memtable. This
      allows to have a memtable write be in parallel to WAL writes.
      3. BatchGroup is not used for writes. This means that we can have
      parallel writers which changes a major assumption in the code base. To
      accommodate for that i) allow only 1 WriteImpl that intends to write to
      memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
      come via group commit phase which is serial anyway, ii) make all the
      parts in the code base that assumed to be the only writer (via
      EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
      protected via a stat_mutex_.
      
      Note: the first commit has the approach figured out but is not clean.
      Submitting the PR anyway to get the early feedback on the approach. If
      we are ok with the approach I will go ahead with this updates:
      0) Rebase with Yi's pipelining changes
      1) Currently batching is disabled by default to make sure that it will be
      consistent with all unit tests. Will make this optional via a config.
      2) A couple of unit tests are disabled. They need to be updated with the
      serial commit of 2PC taken into account.
      3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
      releasing mutex_ beforehand (the same way EnterUnbatched does). This
      needs to be cleaned up.
      Closes https://github.com/facebook/rocksdb/pull/2345
      
      Differential Revision: D5210732
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
      499ebb3a