1. 10 5月, 2019 2 次提交
    • S
      Merging iterator to avoid child iterator reseek for some cases (#5286) · 9fad3e21
      Siying Dong 提交于
      Summary:
      When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
      (1) the iterator represents imutable data
      (2) reseek() to a larger key than the current key
      (3) the current key of the child iterator is larger than the seek key
      because it is guaranteed that the result will fall into the same position.
      
      This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286
      
      Differential Revision: D15283635
      
      Pulled By: siying
      
      fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
      9fad3e21
    • S
      DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244) · 25d81e45
      Siying Dong 提交于
      Summary:
      Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
      
      We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
      
      In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
      
      Differential Revision: D15067257
      
      Pulled By: siying
      
      fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
      25d81e45
  2. 04 5月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
  3. 02 5月, 2019 3 次提交
    • S
      Reduce binary search when reseek into the same data block (#5256) · 4479dff2
      Siying Dong 提交于
      Summary:
      Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following:
      1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file
      2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256
      
      Differential Revision: D15105072
      
      Pulled By: siying
      
      fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80
      4479dff2
    • S
      DB::Close() to fail when there are unreleased snapshots (#5272) · 4e0f2aad
      Siying Dong 提交于
      Summary:
      Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272
      
      Differential Revision: D15159713
      
      Pulled By: siying
      
      fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd
      4e0f2aad
    • M
      Revert snap_refresh_nanos feature (#5269) · 521d234b
      Maysam Yabandeh 提交于
      Summary:
      Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269
      
      Differential Revision: D15151285
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
      521d234b
  4. 01 5月, 2019 1 次提交
  5. 27 4月, 2019 1 次提交
    • S
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri 提交于
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
  6. 26 4月, 2019 3 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
    • A
      Option string/map/file can set env from object registry (#5237) · 6eb317bb
      Andrew Kryczka 提交于
      Summary:
      - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
      - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
      - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
      
      Differential Revision: D15056360
      
      Pulled By: siying
      
      fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
      6eb317bb
    • Y
      Close WAL files before deletion (#5233) · da96f2fe
      Yanqin Jin 提交于
      Summary:
      Currently one thread in RocksDB keeps a WAL file open while another thread
      deletes it. Although the first thread never writes to the WAL again, it still
      tries to close it in the end. This is fine on POSIX, but can be problematic on
      other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
      throw exceptions. The solution is to let the second thread close the WAL before deleting it.
      
      RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
      
      Differential Revision: D15032670
      
      Pulled By: riversand963
      
      fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
      da96f2fe
  7. 25 4月, 2019 1 次提交
  8. 23 4月, 2019 1 次提交
    • A
      Optionally wait on bytes_per_sync to smooth I/O (#5183) · 8272a6de
      Andrew Kryczka 提交于
      Summary:
      The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
      
      My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
      
      Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
      
      There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
      
      The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
      
      Differential Revision: D14953553
      
      Pulled By: siying
      
      fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
      8272a6de
  9. 22 4月, 2019 1 次提交
    • M
      Add BlockBasedTableOptions::index_shortening (#5174) · df38c1ce
      Mike Kolupaev 提交于
      Summary:
      Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
      
      Differential Revision: D14884185
      
      Pulled By: al13n321
      
      fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
      df38c1ce
  10. 20 4月, 2019 1 次提交
  11. 19 4月, 2019 1 次提交
  12. 17 4月, 2019 3 次提交
    • Z
      Avoid double-compacting data in bottom level in manual compactions (#5138) · baa53024
      Zhongyi Xie 提交于
      Summary:
      Depending on the config, manual compaction (leveled compaction style) does following compactions:
      L0->L1
      L1->L2
      ...
      Ln-1 -> Ln
      Ln -> Ln
      The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
      Resolves issue https://github.com/facebook/rocksdb/issues/4995
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
      
      Differential Revision: D14940106
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
      baa53024
    • S
      WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175) · beb44ec3
      Siying Dong 提交于
      Summary:
      Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
      Change it to 256KB accordingly.
      There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175
      
      Differential Revision: D14954289
      
      Pulled By: siying
      
      fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
      beb44ec3
    • Y
      Avoid per-key upper bound check in BlockBasedTableIterator (#5142) · f1239d5f
      yiwu-arbug 提交于
      Summary:
      This is second attempt for #5101. Original commit message:
      `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
      
      This patch come with two fixes:
      
      Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
      
      Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
      
      Differential Revision: D14907113
      
      Pulled By: siying
      
      fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
      f1239d5f
  13. 16 4月, 2019 1 次提交
  14. 13 4月, 2019 2 次提交
    • Y
      Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190) · cca141ec
      yiwu-arbug 提交于
      Summary:
      Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190
      
      Differential Revision: D14923773
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
      cca141ec
    • M
      WritePrepared: fix race condition in reading batch with duplicate keys (#5147) · fe642cbe
      Maysam Yabandeh 提交于
      Summary:
      When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
      The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
      A unit  test is added to reproduce the race condition with duplicate keys.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
      
      Differential Revision: D14758815
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
      fe642cbe
  15. 12 4月, 2019 1 次提交
    • S
      Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165) · ed9f5e21
      Siying Dong 提交于
      Summary:
      Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
      Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
      
      Differential Revision: D14880709
      
      Pulled By: siying
      
      fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
      ed9f5e21
  16. 11 4月, 2019 1 次提交
    • S
      Periodic Compactions (#5166) · d3d20dcd
      Sagar Vemuri 提交于
      Summary:
      Introducing Periodic Compactions.
      
      This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.
      
      - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
      - This works across all levels.
      - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
      - Compaction filters, if any, are invoked as usual.
      - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
      
      This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
      
      Differential Revision: D14884441
      
      Pulled By: sagar0
      
      fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
      d3d20dcd
  17. 09 4月, 2019 1 次提交
    • J
      fix reading encrypted files beyond file boundaries (#5160) · 313e8772
      jsteemann 提交于
      Summary:
      This fix should help reading from encrypted files if the file-to-be-read
      is smaller than expected. For example, when using the encrypted env and
      making it read a journal file of exactly 0 bytes size, the encrypted env
      code crashes with SIGSEGV in its Decrypt function, as there is no check
      if the read attempts to read over the file's boundaries (as specified
      originally by the `dataSize` parameter).
      
      The most important problem this patch addresses is however that there is
      no size underlow check in `CTREncryptionProvider::CreateCipherStream`:
      
      The stream to be read will be initialized to a size of always
      `prefix.size() - (2 * blockSize)`. If the prefix however is smaller than
      twice the block size, this will obviously assume a _very_ large stream
      and read over the bounds. The patch adds a check here as follows:
      
          // If the prefix is smaller than twice the block size, we would below read a
          // very large chunk of the file (and very likely read over the bounds)
          assert(prefix.size() >= 2 * blockSize);
          if (prefix.size() < 2 * blockSize) {
            return Status::Corruption("Unable to read from file " + fname + ": read attempt would read beyond file bounds");
          }
      
      so embedders can catch the error in their release builds.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5160
      
      Differential Revision: D14834633
      
      Pulled By: sagar0
      
      fbshipit-source-id: 47aa39a6db8977252cede054c7eb9a663b9a3484
      313e8772
  18. 03 4月, 2019 1 次提交
    • M
      Mark logs with prepare in PreReleaseCallback (#5121) · 5234fc1b
      Maysam Yabandeh 提交于
      Summary:
      In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
      
      Differential Revision: D14665210
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
      5234fc1b
  19. 02 4月, 2019 1 次提交
    • M
      Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043) · 120bc471
      Mike Kolupaev 提交于
      Summary:
      Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
      
      In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
      
      (EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
      I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
      
      Differential Revision: D14339233
      
      Pulled By: al13n321
      
      fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
      120bc471
  20. 29 3月, 2019 1 次提交
  21. 28 3月, 2019 2 次提交
  22. 27 3月, 2019 3 次提交
    • Y
      Support for single-primary, multi-secondary instances (#4899) · 9358178e
      Yanqin Jin 提交于
      Summary:
      This PR allows RocksDB to run in single-primary, multi-secondary process mode.
      The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
      Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
      
      This PR has several components:
      1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
      
      2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
      
      3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
      3.1 Tail the primary's MANIFEST during recovery.
      3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
      3.3 Tailing WAL will be in a future PR.
      
      4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
      
      Differential Revision: D14510945
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
      9358178e
    • J
      remove bundled but unused fbson library (#5108) · 2a5463ae
      jsteemann 提交于
      Summary:
      fbson library is still included in `third-party` directory, but is not needed by RocksDB anymore.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5108
      
      Differential Revision: D14622272
      
      Pulled By: siying
      
      fbshipit-source-id: 52b24ed17d8d870a71364f85e5bac4eafb192df5
      2a5463ae
    • Y
      Fix SstFileReader not able to open ingested file (#5097) · 75133b1b
      Yi Wu 提交于
      Summary:
      Since `SstFileReader` don't know largest seqno of a file, it will fail this check when it open a file with global seqno: https://github.com/facebook/rocksdb/blob/ca89ac2ba997dfa0e135bd75d4ccf6f5774a7eff/table/block_based_table_reader.cc#L730
      Changes:
      * Pass largest_seqno=kMaxSequenceNumber from `SstFileReader` and allow it to bypass the above check.
      * `BlockBasedTable::VerifyChecksum` also double check if checksum will match when excluding global seqno (this is to make the new test in sst_table_reader_test pass).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5097
      
      Differential Revision: D14607434
      
      Pulled By: riversand963
      
      fbshipit-source-id: 9008599227c5fccbf9b73fee46b3bf4a1523f023
      75133b1b
  23. 20 3月, 2019 1 次提交
  24. 19 3月, 2019 1 次提交
    • S
      Feature for sampling and reporting compressibility (#4842) · b45b1cde
      Shobhit Dayal 提交于
      Summary:
      This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
      1. lz4 or snappy for fast compression
      2. zstd or zlib for slow but higher compression.
      
      The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
      
      The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
      
      Differential Revision: D13629011
      
      Pulled By: shobhitdayal
      
      fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
      b45b1cde
  25. 09 3月, 2019 1 次提交
  26. 02 3月, 2019 2 次提交
  27. 01 3月, 2019 1 次提交
    • S
      Add two more StatsLevel (#5027) · 5e298f86
      Siying Dong 提交于
      Summary:
      Statistics cost too much CPU for some use cases. Add two stats levels
      so that people can choose to skip two types of expensive stats, timers and
      histograms.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027
      
      Differential Revision: D14252765
      
      Pulled By: siying
      
      fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
      5e298f86
  28. 21 2月, 2019 1 次提交
    • Z
      add GetStatsHistory to retrieve stats snapshots (#4748) · c4f5d0aa
      Zhongyi Xie 提交于
      Summary:
      This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
      
      (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
      
      Differential Revision: D13961471
      
      Pulled By: miasantreble
      
      fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
      c4f5d0aa