1. 24 1月, 2018 1 次提交
  2. 19 1月, 2018 1 次提交
    • Y
      Fix Flush() keep waiting after flush finish · f1cb83fc
      Yi Wu 提交于
      Summary:
      Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
      1. User call Flush() with flush_options.wait = true
      2. The manual flush started in the background
      3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
      4. The manual flush finish.
      
      Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.
      
      Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
      Closes https://github.com/facebook/rocksdb/pull/3378
      
      Differential Revision: D6746789
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
      f1cb83fc
  3. 18 1月, 2018 1 次提交
    • A
      fix live WALs purged while file deletions disabled · 46e599fc
      Andrew Kryczka 提交于
      Summary:
      When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.
      
      We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
      Closes https://github.com/facebook/rocksdb/pull/3341
      
      Differential Revision: D6681131
      
      Pulled By: ajkr
      
      fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
      46e599fc
  4. 20 12月, 2017 1 次提交
    • Y
      Port 3 way SSE4.2 crc32c implementation from Folly · f54d7f5f
      yingsu00 提交于
      Summary:
      **# Summary**
      
      RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
      
      This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
      
      **# Performance Test Results**
      We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
      
      Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
      
      1) ReadRandom in db_bench overall metrics
      
          PER RUN
          Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
          3-way      |  1   | 4.143   | 241387 | 26.7
          3-way      |  2   | 3.775   | 264872 | 29.3
          3-way      | 3    | 4.116   | 242929 | 26.9
          FastCrc32c|1  | 4.037   | 247727 | 27.4
          FastCrc32c|2  | 4.648   | 215166 | 23.8
          FastCrc32c|3  | 4.352   | 229799 | 25.4
      
           AVG
          Algorithm     |    Average of micros/op |   Average of ops/sec |    Average of Throughput (MB/s)
          3-way           |     4.01                               |      249,729                 |      27.63
          FastCrc32c  |     4.35                              |     230,897                  |      25.53
      
       2)   Crc32c computation CPU cost (inclusive samples percentage)
          PER RUN
          Implementation | run |  TotalSamples   | Crc32c percentage
          3-way                 |  1    |  4,572,250,000 | 4.37%
          3-way                 |  2    |  3,779,250,000 | 4.62%
          3-way                 |  3    |  4,129,500,000 | 4.48%
          FastCrc32c       |  1    |  4,663,500,000 | 11.24%
          FastCrc32c       |  2    |  4,047,500,000 | 12.34%
          FastCrc32c       |  3    |  4,366,750,000 | 11.68%
      
       **# Test Plan**
           make -j64 corruption_test && ./corruption_test
            By default it uses 3-way SSE algorithm
      
           NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
      
          make clean && DEBUG_LEVEL=0 make -j64 db_bench
          make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
      Closes https://github.com/facebook/rocksdb/pull/3173
      
      Differential Revision: D6330882
      
      Pulled By: yingsu00
      
      fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
      f54d7f5f
  5. 12 12月, 2017 1 次提交
  6. 08 12月, 2017 1 次提交
  7. 07 12月, 2017 2 次提交
  8. 29 11月, 2017 1 次提交
    • Y
      Fix IOError on WAL write doesn't propagate to write group follower · 3cf562be
      Yi Wu 提交于
      Summary:
      This is a simpler version of #3097 by removing all unrelated changes.
      
      Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
      Closes https://github.com/facebook/rocksdb/pull/3201
      
      Differential Revision: D6421644
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e
      3cf562be
  9. 18 11月, 2017 1 次提交
  10. 17 11月, 2017 1 次提交
  11. 02 11月, 2017 2 次提交
    • A
      release 5.9 · cd124215
      Andrew Kryczka 提交于
      Summary:
      updated HISTORY.md and version.h for the release.
      Closes https://github.com/facebook/rocksdb/pull/3110
      
      Differential Revision: D6218645
      
      Pulled By: ajkr
      
      fbshipit-source-id: 99ab8473e9088b02d7596e92351cce7a60a99e93
      cd124215
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
  12. 01 11月, 2017 1 次提交
  13. 29 10月, 2017 1 次提交
  14. 28 10月, 2017 1 次提交
  15. 27 10月, 2017 1 次提交
    • A
      implement lower bound for iterators · 95667383
      Andrew Kryczka 提交于
      Summary:
      - for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
      - for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
      Closes https://github.com/facebook/rocksdb/pull/3074
      
      Differential Revision: D6158654
      
      Pulled By: ajkr
      
      fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
      95667383
  16. 26 10月, 2017 1 次提交
    • A
      single-file bottom-level compaction when snapshot released · 9b18cc23
      Andrew Kryczka 提交于
      Summary:
      When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
      
      - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
      - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
      - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
      - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
      Closes https://github.com/facebook/rocksdb/pull/3009
      
      Differential Revision: D6062044
      
      Pulled By: ajkr
      
      fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
      9b18cc23
  17. 24 10月, 2017 1 次提交
    • Y
      Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e
      Yi Wu 提交于
      Summary:
      With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
      
      My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
      Closes https://github.com/facebook/rocksdb/pull/2842
      
      Differential Revision: D5770600
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
      66a2c44e
  18. 21 10月, 2017 1 次提交
  19. 20 10月, 2017 1 次提交
    • S
      Make FIFO compaction options dynamically configurable · f0804db7
      Sagar Vemuri 提交于
      Summary:
      ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
      
      Some of the ways in which the fifo compaction options can be set are:
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
      - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
      
      Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
      Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
      The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
      Closes https://github.com/facebook/rocksdb/pull/3006
      
      Differential Revision: D6058619
      
      Pulled By: sagar0
      
      fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
      f0804db7
  20. 05 10月, 2017 1 次提交
    • A
      rate limit auto-tuning · 1026e794
      Andrew Kryczka 提交于
      Summary:
      Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes:
      
      - make rate limiter's `Env*` configurable for testing
      - track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter.
      Closes https://github.com/facebook/rocksdb/pull/2899
      
      Differential Revision: D5858704
      
      Pulled By: ajkr
      
      fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1
      1026e794
  21. 28 9月, 2017 1 次提交
    • Q
      Make bytes_per_sync and wal_bytes_per_sync mutable · 6a541afc
      Quinn Jarrell 提交于
      Summary:
      SUMMARY
      Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
      TEST PLAN
      ran make check
      all passed
      
      Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
      Closes https://github.com/facebook/rocksdb/pull/2893
      
      Reviewed By: yiwu-arbug
      
      Differential Revision: D5845814
      
      Pulled By: TheRushingWookie
      
      fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
      6a541afc
  22. 23 9月, 2017 1 次提交
    • Z
      Add test kPointInTimeRecoveryCFConsistency · 1d6700f9
      Zhongyi Xie 提交于
      Summary:
      Context/problem:
      
      - CFs may be flushed at different times
      - A WAL can only be deleted after all CFs have flushed beyond end of that WAL.
      - Point-in-time recovery might stop upon reaching the first corruption.
      - Some CFs may have already flushed beyond that point, while others haven't. We should fail the Open() instead of proceeding with inconsistent CFs.
      Closes https://github.com/facebook/rocksdb/pull/2900
      
      Differential Revision: D5863281
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 180dbaf83d96c804cff49b3c406312a4ae61313e
      1d6700f9
  23. 13 9月, 2017 1 次提交
    • A
      support opening zero backups during engine init · f5148ade
      Andrew Kryczka 提交于
      Summary:
      There are internal users who open BackupEngine for writing new backups only, and they don't care whether old backups can be read or not. The condition `BackupableDBOptions::max_valid_backups_to_open == 0` should be supported (previously in df74b775 I made the mistake of choosing 0 as a special value to disable the limit).
      Closes https://github.com/facebook/rocksdb/pull/2819
      
      Differential Revision: D5751599
      
      Pulled By: ajkr
      
      fbshipit-source-id: e73ac19eb5d756d6b68601eae8e43407ee4f2752
      f5148ade
  24. 31 8月, 2017 1 次提交
  25. 30 8月, 2017 1 次提交
  26. 24 8月, 2017 2 次提交
  27. 22 8月, 2017 1 次提交
  28. 19 8月, 2017 3 次提交
  29. 17 8月, 2017 2 次提交
    • S
      Allow merge operator to be called even with a single operand · 9a44b4c3
      Sagar Vemuri 提交于
      Summary:
      Added a function `MergeOperator::DoesAllowSingleMergeOperand()` to allow invoking a merge operator even with a single merge operand, if overriden.
      
      This is needed for Cassandra-on-RocksDB work. All Cassandra writes are through merges and this will allow a single merge-value to be updated in the merge-operator invoked via a compaction, if needed, due to an expired TTL.
      Closes https://github.com/facebook/rocksdb/pull/2721
      
      Differential Revision: D5608706
      
      Pulled By: sagar0
      
      fbshipit-source-id: f299f9f91c4d1ac26e48bd5906e122c1c5e5f3fc
      9a44b4c3
    • A
      fix deleterange with memtable prefix bloom · af012c0f
      Andrew Kryczka 提交于
      Summary:
      the range delete tombstones in memtable should be added to the aggregator even when the memtable's prefix bloom filter tells us the lookup key's not there. This bug could cause data to temporarily reappear until the memtable containing range deletions is flushed.
      
      Reported in #2743.
      Closes https://github.com/facebook/rocksdb/pull/2745
      
      Differential Revision: D5639007
      
      Pulled By: ajkr
      
      fbshipit-source-id: 04fc6facb6f978340a3f639536f4ca7c0d73dfc9
      af012c0f
  30. 12 8月, 2017 1 次提交
    • A
      fix deletion dropping in intra-L0 · acf935e4
      Andrew Kryczka 提交于
      Summary:
      `KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
      Closes https://github.com/facebook/rocksdb/pull/2726
      
      Differential Revision: D5617286
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
      acf935e4
  31. 04 8月, 2017 1 次提交
    • A
      Introduce bottom-pri thread pool for large universal compactions · cc01985d
      Andrew Kryczka 提交于
      Summary:
      When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
      
      This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
      
      - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
      - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
      - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
      - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
      - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
      Closes https://github.com/facebook/rocksdb/pull/2580
      
      Differential Revision: D5422916
      
      Pulled By: ajkr
      
      fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
      cc01985d
  32. 01 8月, 2017 1 次提交
    • A
      fix db get/write stats · 6a36b3a7
      Andrew Kryczka 提交于
      Summary:
      we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
      Closes https://github.com/facebook/rocksdb/pull/2666
      
      Differential Revision: D5520384
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
      6a36b3a7
  33. 29 7月, 2017 1 次提交
    • S
      Replace dynamic_cast<> · 21696ba5
      Siying Dong 提交于
      Summary:
      Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
      Some nontrivial changes:
      1. Add Comparator::GetRootComparator() to get around the internal comparator hack
      2. Add the two experiemental functions to DB
      3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
      4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
      Closes https://github.com/facebook/rocksdb/pull/2645
      
      Differential Revision: D5502723
      
      Pulled By: siying
      
      fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
      21696ba5
  34. 26 7月, 2017 1 次提交