1. 16 6月, 2018 1 次提交
  2. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
  3. 17 5月, 2018 1 次提交
    • M
      Change and clarify the relationship between Valid(), status() and Seek*() for... · 8bf555f4
      Mike Kolupaev 提交于
      Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
      
      Summary:
      Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
       * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
       * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
      
      However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
      
      This PR changes the convention to:
       * If status() is not ok, Valid() always returns false.
       * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
      
      This does sacrifice the two use cases listed above, but siying said it's ok.
      
      Overview of the changes:
       * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
       * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
       * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
      
      To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
      
      Iterators that didn't need changes:
       * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
       * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
      
      Iterators with changes (see inline comments for details):
       * DBIter - an overhaul:
          - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
          - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
          - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
          - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
          - It used to not reset status on seek for some types of errors.
          - Some simplifications and better comments.
          - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
       * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
       * ForwardIterator - changed to the new convention, also slightly simplified.
       * ForwardLevelIterator - fixed some bugs and simplified.
       * LevelIterator - simplified.
       * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
       * BlockBasedTableIterator - minor changes.
       * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
       * PlainTableIterator - some seeks used to not reset status.
       * CuckooTableIterator - tiny code cleanup.
       * ManagedIterator - fixed some bugs.
       * BaseDeltaIterator - changed to the new convention and fixed a bug.
       * BlobDBIterator - seeks used to not reset status.
       * KeyConvertingIterator - some small change.
      Closes https://github.com/facebook/rocksdb/pull/3810
      
      Differential Revision: D7888019
      
      Pulled By: al13n321
      
      fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
      8bf555f4
  4. 05 5月, 2018 1 次提交
  5. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  6. 24 4月, 2018 1 次提交
    • S
      Revert "Skip deleted WALs during recovery" · d5afa737
      Siying Dong 提交于
      Summary:
      This reverts commit 73f21a7b.
      
      It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:
      
      "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"
      
      This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
      Closes https://github.com/facebook/rocksdb/pull/3762
      
      Differential Revision: D7730035
      
      Pulled By: siying
      
      fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
      d5afa737
  7. 13 4月, 2018 1 次提交
  8. 10 4月, 2018 1 次提交
  9. 06 4月, 2018 1 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
  10. 03 4月, 2018 1 次提交
    • S
      Level Compaction with TTL · 04c11b86
      Sagar Vemuri 提交于
      Summary:
      Level Compaction with TTL.
      
      As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
      
      Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
      This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
      
      This functionality can be controlled by the newly introduced column family option -- ttl.
      
      TODO for later:
      - Make ttl mutable
      - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
      - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
      Closes https://github.com/facebook/rocksdb/pull/3591
      
      Differential Revision: D7275442
      
      Pulled By: sagar0
      
      fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
      04c11b86
  11. 31 3月, 2018 1 次提交
    • M
      Skip deleted WALs during recovery · 73f21a7b
      Maysam Yabandeh 提交于
      Summary:
      This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      Closes https://github.com/facebook/rocksdb/pull/3488
      
      Differential Revision: D6967893
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
      73f21a7b
  12. 17 3月, 2018 1 次提交
  13. 08 3月, 2018 1 次提交
  14. 06 3月, 2018 2 次提交
  15. 02 3月, 2018 1 次提交
    • Y
      Add "rocksdb.live-sst-files-size" DB property · bf937cf1
      Yi Wu 提交于
      Summary:
      Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
      Closes https://github.com/facebook/rocksdb/pull/3548
      
      Differential Revision: D7116939
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
      bf937cf1
  16. 23 2月, 2018 2 次提交
  17. 13 2月, 2018 1 次提交
    • S
      Customized BlockBasedTableIterator and LevelIterator · b555ed30
      Siying Dong 提交于
      Summary:
      Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.
      
      The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.
      
      The two-level-iterator now is only used by partitioned index, so it is simplified.
      Closes https://github.com/facebook/rocksdb/pull/3406
      
      Differential Revision: D6809041
      
      Pulled By: siying
      
      fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
      b555ed30
  18. 03 2月, 2018 1 次提交
  19. 13 12月, 2017 1 次提交
    • Z
      Reduce heavy hitter for Get operation · 51c2ea0f
      Zhongyi Xie 提交于
      Summary:
      This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get`
      
      - rocksdb.block.cache.bytes.write
      - rocksdb.block.cache.add
      - rocksdb.block.cache.data.miss
      - rocksdb.block.cache.data.bytes.insert
      - rocksdb.block.cache.data.add
      - rocksdb.block.cache.hit
      - rocksdb.block.cache.data.hit
      - rocksdb.block.cache.bytes.read
      
      The db_bench statistics before and after the change are:
      
      |1GB block read|Children      |Self  |Command          |Shared Object        |Symbol|
      |---|---|---|---|---|---|
      |master:     |4.22%     |1.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
      |updated:    |0.51%     |0.21%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
      |     	     |0.14%     |0.14%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|
      
      |1MB block read|Children      |Self  |Command          |Shared Object        |Symbol|
      |---|---|---|---|---|---|
      |master:    |3.48%     |1.08%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
      |updated:    |0.80%     |0.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
      |    	     |0.35%     |0.35%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|
      Closes https://github.com/facebook/rocksdb/pull/3172
      
      Differential Revision: D6330532
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741
      51c2ea0f
  20. 08 12月, 2017 1 次提交
    • P
      Fix coverity issues version, write_batch · 34aa245d
      Prashant D 提交于
      Summary:
      db/version_builder.cc:
      117        base_vstorage_->InternalComparator();
      
      CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
      2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls.
      
      db/version_edit.h:
      145  FdWithKeyRange()
      146      : fd(),
      147        smallest_key(),
      148        largest_key() {
      
      CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
      2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls.
      149  }
      
      db/version_set.cc:
      120    }
      
      CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
      4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls.
      121  }
      
      db/write_batch.cc:
       939    assert(cf_mems_);
      
      CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
      3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls.
       940  }
      Closes https://github.com/facebook/rocksdb/pull/3092
      
      Differential Revision: D6505666
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86
      34aa245d
  21. 07 12月, 2017 1 次提交
    • A
      Preserve overlapping file endpoint invariant · 78d1a5ec
      Andrew Kryczka 提交于
      Summary:
      Fix for #2833.
      
      - In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
      - In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
      - In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
      Closes https://github.com/facebook/rocksdb/pull/2843
      
      Differential Revision: D5772387
      
      Pulled By: ajkr
      
      fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
      78d1a5ec
  22. 01 12月, 2017 1 次提交
    • M
      WritePrepared Txn: PreReleaseCallback · 18dcf7f9
      Maysam Yabandeh 提交于
      Summary:
      Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
      These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
      Closes https://github.com/facebook/rocksdb/pull/3205
      
      Differential Revision: D6438959
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
      18dcf7f9
  23. 29 11月, 2017 1 次提交
    • A
      optimize file ingestion checks for range deletion overlap · 1bdb44de
      Andrew Kryczka 提交于
      Summary:
      Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
      Closes https://github.com/facebook/rocksdb/pull/3179
      
      Differential Revision: D6358125
      
      Pulled By: ajkr
      
      fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
      1bdb44de
  24. 17 11月, 2017 1 次提交
  25. 11 11月, 2017 1 次提交
  26. 01 11月, 2017 1 次提交
  27. 26 10月, 2017 1 次提交
    • A
      single-file bottom-level compaction when snapshot released · 9b18cc23
      Andrew Kryczka 提交于
      Summary:
      When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
      
      - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
      - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
      - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
      - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
      Closes https://github.com/facebook/rocksdb/pull/3009
      
      Differential Revision: D6062044
      
      Pulled By: ajkr
      
      fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
      9b18cc23
  28. 24 10月, 2017 1 次提交
  29. 20 10月, 2017 1 次提交
    • S
      Make FIFO compaction options dynamically configurable · f0804db7
      Sagar Vemuri 提交于
      Summary:
      ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
      
      Some of the ways in which the fifo compaction options can be set are:
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
      - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
      
      Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
      Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
      The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
      Closes https://github.com/facebook/rocksdb/pull/3006
      
      Differential Revision: D6058619
      
      Pulled By: sagar0
      
      fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
      f0804db7
  30. 11 10月, 2017 1 次提交
    • A
      fix file numbers after repair · 70aa9421
      Andrew Kryczka 提交于
      Summary:
      The file numbers assigned post-repair were sometimes smaller than older files' numbers due to `LogAndApply` saving the wrong next file number in the manifest.
      
      - Mark the highest file seen during repair as used before `LogAndApply` so the correct next file number will be stored.
      - Renamed `MarkFileNumberUsedDuringRecovery` to `MarkFileNumberUsed` since now it's used during repair in addition to during recovery
      - Added `TEST_Current_Next_FileNo` to expose the next file number for the unit test.
      Closes https://github.com/facebook/rocksdb/pull/2988
      
      Differential Revision: D6018083
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3f25cbf74439cb8f16dd12af90b67f9f9f75e718
      70aa9421
  31. 04 10月, 2017 1 次提交
    • Y
      Add ValueType::kTypeBlobIndex · d1cab2b6
      Yi Wu 提交于
      Summary:
      Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
      1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
      2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
      3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
      
      The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
      
      Changes on blob db side will be in a separate patch.
      Closes https://github.com/facebook/rocksdb/pull/2886
      
      Differential Revision: D5838431
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
      d1cab2b6
  32. 29 9月, 2017 1 次提交
  33. 13 9月, 2017 1 次提交
    • A
      Fix naming in InternalKey · 5785b1fc
      Amy Xu 提交于
      Summary:
      - Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
      Closes https://github.com/facebook/rocksdb/pull/2868
      
      Differential Revision: D5804152
      
      Pulled By: axxufb
      
      fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
      5785b1fc
  34. 12 9月, 2017 1 次提交
    • M
      write-prepared txn: call IsInSnapshot · f46464d3
      Maysam Yabandeh 提交于
      Summary:
      This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
      Closes https://github.com/facebook/rocksdb/pull/2850
      
      Differential Revision: D5787375
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
      f46464d3
  35. 25 8月, 2017 1 次提交
    • Y
      Allow DB reopen with reduced options.num_levels · 3c840d1a
      Yi Wu 提交于
      Summary:
      Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.
      
      This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
      Closes https://github.com/facebook/rocksdb/pull/2740
      
      Differential Revision: D5629354
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
      3c840d1a
  36. 04 8月, 2017 1 次提交
    • A
      Introduce bottom-pri thread pool for large universal compactions · cc01985d
      Andrew Kryczka 提交于
      Summary:
      When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
      
      This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
      
      - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
      - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
      - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
      - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
      - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
      Closes https://github.com/facebook/rocksdb/pull/2580
      
      Differential Revision: D5422916
      
      Pulled By: ajkr
      
      fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
      cc01985d
  37. 28 7月, 2017 2 次提交
    • A
      fix asan/valgrind for TableCache cleanup · 710411ae
      Andrew Kryczka 提交于
      Summary:
      Breaking commit: d12691b8
      
      In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.
      
      This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
      Closes https://github.com/facebook/rocksdb/pull/2662
      
      Differential Revision: D5515108
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
      710411ae
    • A
      move TableCache::EraseHandle outside of db mutex · d12691b8
      Andrew Kryczka 提交于
      Summary:
      Post-compaction work holds onto db mutex for the longest time (found by tracing lock acquires/releases with LTTng and correlating timestamps with our info log). Further experimentation showed `TableCache::EraseHandle` is responsible for ~86% of time mutex is held. We can just release the handle outside the db mutex.
      Closes https://github.com/facebook/rocksdb/pull/2654
      
      Differential Revision: D5507126
      
      Pulled By: ajkr
      
      fbshipit-source-id: 703c01ddf2aea16bc0f9e33c08935d78aa6b781d
      d12691b8