1. 26 4月, 2019 3 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
    • A
      Option string/map/file can set env from object registry (#5237) · 6eb317bb
      Andrew Kryczka 提交于
      Summary:
      - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
      - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
      - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
      
      Differential Revision: D15056360
      
      Pulled By: siying
      
      fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
      6eb317bb
    • Y
      Close WAL files before deletion (#5233) · da96f2fe
      Yanqin Jin 提交于
      Summary:
      Currently one thread in RocksDB keeps a WAL file open while another thread
      deletes it. Although the first thread never writes to the WAL again, it still
      tries to close it in the end. This is fine on POSIX, but can be problematic on
      other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
      throw exceptions. The solution is to let the second thread close the WAL before deleting it.
      
      RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
      
      Differential Revision: D15032670
      
      Pulled By: riversand963
      
      fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
      da96f2fe
  2. 25 4月, 2019 1 次提交
  3. 23 4月, 2019 1 次提交
    • A
      Optionally wait on bytes_per_sync to smooth I/O (#5183) · 8272a6de
      Andrew Kryczka 提交于
      Summary:
      The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
      
      My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
      
      Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
      
      There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
      
      The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
      
      Differential Revision: D14953553
      
      Pulled By: siying
      
      fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
      8272a6de
  4. 22 4月, 2019 1 次提交
    • M
      Add BlockBasedTableOptions::index_shortening (#5174) · df38c1ce
      Mike Kolupaev 提交于
      Summary:
      Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
      
      Differential Revision: D14884185
      
      Pulled By: al13n321
      
      fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
      df38c1ce
  5. 20 4月, 2019 1 次提交
  6. 19 4月, 2019 1 次提交
  7. 17 4月, 2019 3 次提交
    • Z
      Avoid double-compacting data in bottom level in manual compactions (#5138) · baa53024
      Zhongyi Xie 提交于
      Summary:
      Depending on the config, manual compaction (leveled compaction style) does following compactions:
      L0->L1
      L1->L2
      ...
      Ln-1 -> Ln
      Ln -> Ln
      The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
      Resolves issue https://github.com/facebook/rocksdb/issues/4995
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
      
      Differential Revision: D14940106
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
      baa53024
    • S
      WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175) · beb44ec3
      Siying Dong 提交于
      Summary:
      Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
      Change it to 256KB accordingly.
      There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175
      
      Differential Revision: D14954289
      
      Pulled By: siying
      
      fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
      beb44ec3
    • Y
      Avoid per-key upper bound check in BlockBasedTableIterator (#5142) · f1239d5f
      yiwu-arbug 提交于
      Summary:
      This is second attempt for #5101. Original commit message:
      `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
      
      This patch come with two fixes:
      
      Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
      
      Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
      
      Differential Revision: D14907113
      
      Pulled By: siying
      
      fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
      f1239d5f
  8. 16 4月, 2019 1 次提交
  9. 13 4月, 2019 2 次提交
    • Y
      Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190) · cca141ec
      yiwu-arbug 提交于
      Summary:
      Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190
      
      Differential Revision: D14923773
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
      cca141ec
    • M
      WritePrepared: fix race condition in reading batch with duplicate keys (#5147) · fe642cbe
      Maysam Yabandeh 提交于
      Summary:
      When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
      The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
      A unit  test is added to reproduce the race condition with duplicate keys.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
      
      Differential Revision: D14758815
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
      fe642cbe
  10. 12 4月, 2019 1 次提交
    • S
      Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165) · ed9f5e21
      Siying Dong 提交于
      Summary:
      Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
      Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
      
      Differential Revision: D14880709
      
      Pulled By: siying
      
      fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
      ed9f5e21
  11. 11 4月, 2019 1 次提交
    • S
      Periodic Compactions (#5166) · d3d20dcd
      Sagar Vemuri 提交于
      Summary:
      Introducing Periodic Compactions.
      
      This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.
      
      - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
      - This works across all levels.
      - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
      - Compaction filters, if any, are invoked as usual.
      - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
      
      This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
      
      Differential Revision: D14884441
      
      Pulled By: sagar0
      
      fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
      d3d20dcd
  12. 09 4月, 2019 1 次提交
    • J
      fix reading encrypted files beyond file boundaries (#5160) · 313e8772
      jsteemann 提交于
      Summary:
      This fix should help reading from encrypted files if the file-to-be-read
      is smaller than expected. For example, when using the encrypted env and
      making it read a journal file of exactly 0 bytes size, the encrypted env
      code crashes with SIGSEGV in its Decrypt function, as there is no check
      if the read attempts to read over the file's boundaries (as specified
      originally by the `dataSize` parameter).
      
      The most important problem this patch addresses is however that there is
      no size underlow check in `CTREncryptionProvider::CreateCipherStream`:
      
      The stream to be read will be initialized to a size of always
      `prefix.size() - (2 * blockSize)`. If the prefix however is smaller than
      twice the block size, this will obviously assume a _very_ large stream
      and read over the bounds. The patch adds a check here as follows:
      
          // If the prefix is smaller than twice the block size, we would below read a
          // very large chunk of the file (and very likely read over the bounds)
          assert(prefix.size() >= 2 * blockSize);
          if (prefix.size() < 2 * blockSize) {
            return Status::Corruption("Unable to read from file " + fname + ": read attempt would read beyond file bounds");
          }
      
      so embedders can catch the error in their release builds.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5160
      
      Differential Revision: D14834633
      
      Pulled By: sagar0
      
      fbshipit-source-id: 47aa39a6db8977252cede054c7eb9a663b9a3484
      313e8772
  13. 03 4月, 2019 1 次提交
    • M
      Mark logs with prepare in PreReleaseCallback (#5121) · 5234fc1b
      Maysam Yabandeh 提交于
      Summary:
      In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
      
      Differential Revision: D14665210
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
      5234fc1b
  14. 02 4月, 2019 1 次提交
    • M
      Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043) · 120bc471
      Mike Kolupaev 提交于
      Summary:
      Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
      
      In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
      
      (EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
      I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
      
      Differential Revision: D14339233
      
      Pulled By: al13n321
      
      fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
      120bc471
  15. 29 3月, 2019 1 次提交
  16. 28 3月, 2019 2 次提交
  17. 27 3月, 2019 3 次提交
    • Y
      Support for single-primary, multi-secondary instances (#4899) · 9358178e
      Yanqin Jin 提交于
      Summary:
      This PR allows RocksDB to run in single-primary, multi-secondary process mode.
      The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
      Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
      
      This PR has several components:
      1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
      
      2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
      
      3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
      3.1 Tail the primary's MANIFEST during recovery.
      3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
      3.3 Tailing WAL will be in a future PR.
      
      4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
      
      Differential Revision: D14510945
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
      9358178e
    • J
      remove bundled but unused fbson library (#5108) · 2a5463ae
      jsteemann 提交于
      Summary:
      fbson library is still included in `third-party` directory, but is not needed by RocksDB anymore.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5108
      
      Differential Revision: D14622272
      
      Pulled By: siying
      
      fbshipit-source-id: 52b24ed17d8d870a71364f85e5bac4eafb192df5
      2a5463ae
    • Y
      Fix SstFileReader not able to open ingested file (#5097) · 75133b1b
      Yi Wu 提交于
      Summary:
      Since `SstFileReader` don't know largest seqno of a file, it will fail this check when it open a file with global seqno: https://github.com/facebook/rocksdb/blob/ca89ac2ba997dfa0e135bd75d4ccf6f5774a7eff/table/block_based_table_reader.cc#L730
      Changes:
      * Pass largest_seqno=kMaxSequenceNumber from `SstFileReader` and allow it to bypass the above check.
      * `BlockBasedTable::VerifyChecksum` also double check if checksum will match when excluding global seqno (this is to make the new test in sst_table_reader_test pass).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5097
      
      Differential Revision: D14607434
      
      Pulled By: riversand963
      
      fbshipit-source-id: 9008599227c5fccbf9b73fee46b3bf4a1523f023
      75133b1b
  18. 20 3月, 2019 1 次提交
  19. 19 3月, 2019 1 次提交
    • S
      Feature for sampling and reporting compressibility (#4842) · b45b1cde
      Shobhit Dayal 提交于
      Summary:
      This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
      1. lz4 or snappy for fast compression
      2. zstd or zlib for slow but higher compression.
      
      The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
      
      The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
      
      Differential Revision: D13629011
      
      Pulled By: shobhitdayal
      
      fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
      b45b1cde
  20. 09 3月, 2019 1 次提交
  21. 02 3月, 2019 2 次提交
  22. 01 3月, 2019 1 次提交
    • S
      Add two more StatsLevel (#5027) · 5e298f86
      Siying Dong 提交于
      Summary:
      Statistics cost too much CPU for some use cases. Add two stats levels
      so that people can choose to skip two types of expensive stats, timers and
      histograms.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027
      
      Differential Revision: D14252765
      
      Pulled By: siying
      
      fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
      5e298f86
  23. 21 2月, 2019 2 次提交
    • Z
      add GetStatsHistory to retrieve stats snapshots (#4748) · c4f5d0aa
      Zhongyi Xie 提交于
      Summary:
      This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
      
      (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
      
      Differential Revision: D13961471
      
      Pulled By: miasantreble
      
      fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
      c4f5d0aa
    • F
      Update version and history for 6.0 · 48c8d844
      Fosco Marotto 提交于
      48c8d844
  24. 20 2月, 2019 1 次提交
  25. 16 2月, 2019 1 次提交
    • A
      Deprecate ttl option from CompactionOptionsFIFO (#4965) · 3231a2e5
      Aubin Sanyal 提交于
      Summary:
      We introduced ttl option in CompactionOptionsFIFO when ttl-based file
      deletion (compaction) was supported only as part of FIFO Compaction. But
      with the extension of ttl semantics even to Level compaction,
      CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
      using ColumnFamilyOptions.ttl for FIFO compaction as well.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965
      
      Differential Revision: D14072960
      
      Pulled By: sagar0
      
      fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
      3231a2e5
  26. 15 2月, 2019 1 次提交
    • A
      Dictionary compression for files written by SstFileWriter (#4978) · c8c8104d
      Andrew Kryczka 提交于
      Summary:
      If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options.
      
      I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978
      
      Differential Revision: D14060763
      
      Pulled By: ajkr
      
      fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a
      c8c8104d
  27. 14 2月, 2019 1 次提交
  28. 13 2月, 2019 1 次提交
    • Y
      Atomic ingest (#4895) · a69d4dee
      Yanqin Jin 提交于
      Summary:
      Make file ingestion atomic.
      
       as title.
      Ingesting external SST files into multiple column families should be atomic. If
      a crash occurs and db reopens, either all column families have successfully
      ingested the files before the crash, or non of the ingestions have any effect
      on the state of the db.
      
      Also add unit tests for atomic ingestion.
      
      Note that the unit test here does not cover the case of incomplete atomic group
      in the MANIFEST, which is covered in VersionSetTest already.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895
      
      Differential Revision: D13718245
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
      a69d4dee
  29. 12 2月, 2019 2 次提交
    • A
      Reduce scope of compression dictionary to single SST (#4952) · 62f70f6d
      Andrew Kryczka 提交于
      Summary:
      Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
      
      So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
      
      - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
      - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
      - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
      
      Differential Revision: D13967980
      
      Pulled By: ajkr
      
      fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
      62f70f6d
    • P
      Increment NUMBER_BLOCK_NOT_COMPRESSED when !GoodCompressionRatio (#4929) · 79496d71
      Peter (Stig) Edwards 提交于
      Summary:
      See https://github.com/facebook/rocksdb/issues/4884
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4929
      
      Differential Revision: D14028333
      
      Pulled By: sagar0
      
      fbshipit-source-id: eed12bceae85385a34aaa6dd303bf0f53c4c7b06
      79496d71