1. 02 11月, 2017 4 次提交
    • A
      release 5.9 · cd124215
      Andrew Kryczka 提交于
      Summary:
      updated HISTORY.md and version.h for the release.
      Closes https://github.com/facebook/rocksdb/pull/3110
      
      Differential Revision: D6218645
      
      Pulled By: ajkr
      
      fbshipit-source-id: 99ab8473e9088b02d7596e92351cce7a60a99e93
      cd124215
    • M
      WritePrepared Txn: ValidateSnapshot · 02693f64
      Maysam Yabandeh 提交于
      Summary:
      Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
      Closes https://github.com/facebook/rocksdb/pull/3101
      
      Differential Revision: D6199405
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
      02693f64
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
    • M
      WritePrepared Txn: Optimize for recoverable state · 17731a43
      Maysam Yabandeh 提交于
      Summary:
      GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
      This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
      Closes https://github.com/facebook/rocksdb/pull/3071
      
      Differential Revision: D6148023
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
      17731a43
  2. 01 11月, 2017 4 次提交
  3. 31 10月, 2017 2 次提交
  4. 30 10月, 2017 1 次提交
  5. 29 10月, 2017 1 次提交
  6. 28 10月, 2017 7 次提交
    • A
      always drop tombstones compacted to bottommost level · 6a9335db
      Andrew Kryczka 提交于
      Summary:
      Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels.
      Closes https://github.com/facebook/rocksdb/pull/3085
      
      Differential Revision: D6175742
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b
      6a9335db
    • Y
      TableProperty::oldest_key_time defaults to 0 · 84a04af9
      Yi Wu 提交于
      Summary:
      We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.
      
      Also revert db_sst_test back to before #2842.
      Closes https://github.com/facebook/rocksdb/pull/3079
      
      Differential Revision: D6165702
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
      84a04af9
    • I
      Mark files as trash by using .trash extension · 05993155
      Islam AbdelRahman 提交于
      Summary:
      SstFileManager move files that need to be deleted into a trash directory.
      Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
      Closes https://github.com/facebook/rocksdb/pull/2970
      
      Differential Revision: D5976805
      
      Pulled By: IslamAbdelRahman
      
      fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
      05993155
    • Y
      Blob DB: update blob file format · 3ebb7ba7
      Yi Wu 提交于
      Summary:
      Changing blob file format and some code cleanup around the change. The change with blob log format are:
      * Remove timestamp field in blob file header, blob file footer and blob records. The field is not being use and often confuse with expiration field.
      * Blob file header now come with column family id, which always equal to default column family id. It leaves room for future support of column family.
      * Compression field in blob file header now is a standalone byte (instead of compact encode with flags field)
      * Blob file footer now come with its own crc.
      * Key length now being uint64_t instead of uint32_t
      * Blob CRC now checksum both key and value (instead of value only).
      * Some reordering of the fields.
      
      The list of cleanups:
      * Better inline comments in blob_log_format.h
      * rename ttlrange_t and snrange_t to ExpirationRange and SequenceRange respectively.
      * simplify blob_db::Reader
      * Move crc checking logic to inside blob_log_format.cc
      Closes https://github.com/facebook/rocksdb/pull/3081
      
      Differential Revision: D6171304
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: e4373e0d39264441b7e2fbd0caba93ddd99ea2af
      3ebb7ba7
    • D
      Enable cacheline_aligned_alloc() to allocate from jemalloc if enabled. · 682db813
      Dmitri Smirnov 提交于
      Summary:
      Reuse WITH_JEMALLOC option in preparation for module search unification.
        Move jemalloc overrides into a separate .cc
        Remote obsolete JEMALLOC_NOINIT option.
      Closes https://github.com/facebook/rocksdb/pull/3078
      
      Differential Revision: D6174826
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 9970a0289b4490272d15853920d9d7531af91140
      682db813
    • P
      Fix coverity uninitialized fields warnings in lru_cache · d9240b54
      Prashant D 提交于
      Summary:
      Coverity uninitialized member variable warnings in lru_cache
      Closes https://github.com/facebook/rocksdb/pull/3082
      
      Differential Revision: D6173062
      
      Pulled By: sagar0
      
      fbshipit-source-id: 7bcfc653457bd362d46045d06527838c9a6adad6
      d9240b54
    • P
      Fix coverity issues column_family, compaction_db/iterator · 50e95a63
      Prashant D 提交于
      Summary:
      db/column_family.h :
      79  ColumnFamilyHandleInternal()
      
      CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
      2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls.
       80      : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}
      
      db/compacted_db_impl.cc:
       18CompactedDBImpl::CompactedDBImpl(
       19  const DBOptions& options, const std::string& dbname)
       20  : DBImpl(options, dbname) {
         	2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls.
         	4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls.
      
      CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
      6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls.
       21}
      
      db/compaction_iterator.cc:
      9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls.
      11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls.
      
      CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
      13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls.
      Closes https://github.com/facebook/rocksdb/pull/3084
      
      Differential Revision: D6172999
      
      Pulled By: sagar0
      
      fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460
      50e95a63
  7. 27 10月, 2017 4 次提交
    • P
      Fix coverity uninitialized fields warnings · 47166bae
      Prashant D 提交于
      Pulled By: ajkr
      
      Differential Revision: D6170448
      
      fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c
      47166bae
    • P
      Fix coverity issue for MutableDBOptions default constructor · 67b29e26
      Prashant D 提交于
      Summary:
      228MutableDBOptions::MutableDBOptions()
      229    : max_background_jobs(2),
      230      base_background_compactions(-1),
      231      max_background_compactions(-1),
      232      avoid_flush_during_shutdown(false),
      233      delayed_write_rate(2 * 1024U * 1024U),
      234      max_total_wal_size(0),
      235      delete_obsolete_files_period_micros(6ULL * 60 * 60 * 1000000),
      236      stats_dump_period_sec(600),
         	2. uninit_member: Non-static class member bytes_per_sync is not initialized in this constructor nor in any functions that it calls.
      
      CID 1419857 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
      4. uninit_member: Non-static class member wal_bytes_per_sync is not initialized in this constructor nor in any functions that it calls.
      237      max_open_files(-1) {}
      Closes https://github.com/facebook/rocksdb/pull/3069
      
      Differential Revision: D6170424
      
      Pulled By: ajkr
      
      fbshipit-source-id: 1f94e86b87611ad2330b8b1707911150978d68b8
      67b29e26
    • A
      implement lower bound for iterators · 95667383
      Andrew Kryczka 提交于
      Summary:
      - for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
      - for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
      Closes https://github.com/facebook/rocksdb/pull/3074
      
      Differential Revision: D6158654
      
      Pulled By: ajkr
      
      fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
      95667383
    • Y
      Blob DB: Inline small values in base DB · 5a2a6483
      Yi Wu 提交于
      Summary:
      Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
      
      Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
      
      See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
      * small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
      * small value w/ TTL: put (type, expiration, value) to base db.
      * large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
      * large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
      Closes https://github.com/facebook/rocksdb/pull/3066
      
      Differential Revision: D6142115
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
      5a2a6483
  8. 26 10月, 2017 3 次提交
    • A
      single-file bottom-level compaction when snapshot released · 9b18cc23
      Andrew Kryczka 提交于
      Summary:
      When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
      
      - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
      - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
      - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
      - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
      Closes https://github.com/facebook/rocksdb/pull/3009
      
      Differential Revision: D6062044
      
      Pulled By: ajkr
      
      fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
      9b18cc23
    • S
      Return write error on reaching blob dir size limit · 96e3a600
      Sagar Vemuri 提交于
      Summary:
      I found that we continue accepting writes even when the blob db goes beyond the configured blob directory size limit. Now, we return an error for writes on reaching `blob_dir_size` limit and if `is_fifo` is set to false. (We cannot just drop any file when `is_fifo` is true.)
      
      Deleting the oldest file when `is_fifo` is true will be handled in a later PR.
      Closes https://github.com/facebook/rocksdb/pull/3060
      
      Differential Revision: D6136156
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2f11cb3f2eedfa94524fbfa2613dd64bfad7a23c
      96e3a600
    • I
      Fix tombstone scans in SeekForPrev outside prefix · addfe1ef
      Islam AbdelRahman 提交于
      Summary:
      When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true
      
      Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev()
      Closes https://github.com/facebook/rocksdb/pull/3067
      
      Differential Revision: D6149638
      
      Pulled By: IslamAbdelRahman
      
      fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18
      addfe1ef
  9. 25 10月, 2017 1 次提交
  10. 24 10月, 2017 4 次提交
    • Z
      added missing subcodes and improved error message for missing enum values · 57fcdc26
      zawlazaw 提交于
      Summary:
      Java's `Status.SubCode` was out of sync with `include/rocksdb/status.h:SubCode`.
      
      When running out of disc space this led to an `IllegalArgumentException` because of an invalid status code, rather than just returning the corresponding status code without an exception.
      
      I added the missing status codes.
      
      By this, we keep the behaviour of throwing an `IllegalArgumentException` in case of newly added status codes that are defined in C but not in Java.
      
      We could think of an alternative strategy: add in Java another code "UnknownCode" which acts as a catch-all for all those status codes that are not yet mirrored from C to Java. This approach would never throw an exception but simply return a non-OK status-code.
      
      I think the current approach of throwing an Exception in case of a C/Java inconsistency is fine, but if you have some opinion on the alternative strategy, then feel free to comment here.
      Closes https://github.com/facebook/rocksdb/pull/3050
      
      Differential Revision: D6129682
      
      Pulled By: sagar0
      
      fbshipit-source-id: f2bf44caad650837cffdcb1f93eb793b43580c66
      57fcdc26
    • Y
      Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e
      Yi Wu 提交于
      Summary:
      With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
      
      My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
      Closes https://github.com/facebook/rocksdb/pull/2842
      
      Differential Revision: D5770600
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
      66a2c44e
    • D
      Fix unused var warnings in Release mode · d2a65c59
      Dmitri Smirnov 提交于
      Summary:
      MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
      Closes https://github.com/facebook/rocksdb/pull/3048
      
      Differential Revision: D6126272
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
      d2a65c59
    • M
      Enable two write queues for transactions · 63822eb7
      Maysam Yabandeh 提交于
      Summary:
      Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
      Closes https://github.com/facebook/rocksdb/pull/3046
      
      Differential Revision: D6106534
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
      63822eb7
  11. 21 10月, 2017 6 次提交
  12. 20 10月, 2017 2 次提交
    • S
      Make FIFO compaction options dynamically configurable · f0804db7
      Sagar Vemuri 提交于
      Summary:
      ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
      
      Some of the ways in which the fifo compaction options can be set are:
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
      - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
      
      Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
      Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
      The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
      Closes https://github.com/facebook/rocksdb/pull/3006
      
      Differential Revision: D6058619
      
      Pulled By: sagar0
      
      fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
      f0804db7
    • D
      Enable MSVC W4 with a few exceptions. Fix warnings and bugs · ebab2e2d
      Dmitri Smirnov 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/3018
      
      Differential Revision: D6079011
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57
      ebab2e2d
  13. 19 10月, 2017 1 次提交
    • S
      Update RocksDB Authors File · b7499945
      Sagar Vemuri 提交于
      Summary: Update RocksDB Authors File.
      
      Reviewed By: yiwu-arbug
      
      Differential Revision: D6075453
      
      fbshipit-source-id: dff52f483aab33c41de391f145a8273acfd6cbde
      b7499945