1. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  2. 04 5月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
  3. 02 5月, 2019 1 次提交
  4. 26 4月, 2019 2 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
    • A
      Option string/map/file can set env from object registry (#5237) · 6eb317bb
      Andrew Kryczka 提交于
      Summary:
      - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
      - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
      - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
      
      Differential Revision: D15056360
      
      Pulled By: siying
      
      fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
      6eb317bb
  5. 23 4月, 2019 1 次提交
    • A
      Optionally wait on bytes_per_sync to smooth I/O (#5183) · 8272a6de
      Andrew Kryczka 提交于
      Summary:
      The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
      
      My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
      
      Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
      
      There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
      
      The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
      
      Differential Revision: D14953553
      
      Pulled By: siying
      
      fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
      8272a6de
  6. 22 4月, 2019 1 次提交
    • M
      Add BlockBasedTableOptions::index_shortening (#5174) · df38c1ce
      Mike Kolupaev 提交于
      Summary:
      Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
      
      Differential Revision: D14884185
      
      Pulled By: al13n321
      
      fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
      df38c1ce
  7. 11 4月, 2019 1 次提交
    • S
      Periodic Compactions (#5166) · d3d20dcd
      Sagar Vemuri 提交于
      Summary:
      Introducing Periodic Compactions.
      
      This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.
      
      - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
      - This works across all levels.
      - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
      - Compaction filters, if any, are invoked as usual.
      - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
      
      This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
      
      Differential Revision: D14884441
      
      Pulled By: sagar0
      
      fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
      d3d20dcd
  8. 02 4月, 2019 1 次提交
    • M
      Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043) · 120bc471
      Mike Kolupaev 提交于
      Summary:
      Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
      
      In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
      
      (EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
      I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
      
      Differential Revision: D14339233
      
      Pulled By: al13n321
      
      fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
      120bc471
  9. 29 3月, 2019 1 次提交
  10. 27 3月, 2019 1 次提交
  11. 19 3月, 2019 1 次提交
    • S
      Feature for sampling and reporting compressibility (#4842) · b45b1cde
      Shobhit Dayal 提交于
      Summary:
      This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
      1. lz4 or snappy for fast compression
      2. zstd or zlib for slow but higher compression.
      
      The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
      
      The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
      
      Differential Revision: D13629011
      
      Pulled By: shobhitdayal
      
      fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
      b45b1cde
  12. 13 3月, 2019 1 次提交
  13. 21 2月, 2019 1 次提交
    • Z
      add GetStatsHistory to retrieve stats snapshots (#4748) · c4f5d0aa
      Zhongyi Xie 提交于
      Summary:
      This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
      
      (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
      
      Differential Revision: D13961471
      
      Pulled By: miasantreble
      
      fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
      c4f5d0aa
  14. 20 2月, 2019 1 次提交
  15. 16 2月, 2019 1 次提交
    • A
      Deprecate ttl option from CompactionOptionsFIFO (#4965) · 3231a2e5
      Aubin Sanyal 提交于
      Summary:
      We introduced ttl option in CompactionOptionsFIFO when ttl-based file
      deletion (compaction) was supported only as part of FIFO Compaction. But
      with the extension of ttl semantics even to Level compaction,
      CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
      using ColumnFamilyOptions.ttl for FIFO compaction as well.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965
      
      Differential Revision: D14072960
      
      Pulled By: sagar0
      
      fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
      3231a2e5
  16. 13 11月, 2018 1 次提交
    • Y
      Remove redundant member var and set options (#4631) · 05dec0c7
      Yanqin Jin 提交于
      Summary:
      In the past, both `DBImpl::atomic_flush_` and
      `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
      `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
      set correctly. This does not lead to incorrect behavior, but is a duplicate of
      information.
      
      Since `immutable_db_options_` is always there and has `atomic_flush`, we should
      use it as source of truth and remove `DBImpl::atomic_flush_`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631
      
      Differential Revision: D12928371
      
      Pulled By: riversand963
      
      fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
      05dec0c7
  17. 02 11月, 2018 1 次提交
  18. 27 10月, 2018 1 次提交
  19. 16 8月, 2018 1 次提交
    • F
      Improve point-lookup performance using a data block hash index (#4174) · 19ec44fd
      Fenggang Wu 提交于
      Summary:
      Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
      
      The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
      
      Differential Revision: D8965914
      
      Pulled By: fgwu
      
      fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
      19ec44fd
  20. 28 7月, 2018 1 次提交
    • F
      Add DataBlockIndexType option in BlockBasedTableOptions (#4150) · a11df583
      Fenggang Wu 提交于
      Summary:
      Added DataBlockIndexType option in BlockBasedTableOptions.
      ```
      enum DataBlockIndexType : char {
          kDataBlockBinarySearch = 0, // traditional block type
          kDataBlockHashIndex = 1, // additional hash index appended to the end.
      };
      ```
      The default type is the traditional binary seek option: `kDataBlockBinarySearch`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150
      
      Differential Revision: D8895958
      
      Pulled By: fgwu
      
      fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10
      a11df583
  21. 17 7月, 2018 1 次提交
  22. 28 6月, 2018 1 次提交
  23. 27 6月, 2018 1 次提交
  24. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
  25. 15 5月, 2018 1 次提交
  26. 09 5月, 2018 1 次提交
  27. 11 4月, 2018 1 次提交
    • A
      fix calling SetOptions on deprecated options · 019d7894
      Andrew Kryczka 提交于
      Summary:
      In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:
      
      ```
      db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
      ```
      
      We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
      Closes https://github.com/facebook/rocksdb/pull/3700
      
      Differential Revision: D7572596
      
      Pulled By: ajkr
      
      fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
      019d7894
  28. 06 4月, 2018 1 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
  29. 03 4月, 2018 1 次提交
    • S
      Level Compaction with TTL · 04c11b86
      Sagar Vemuri 提交于
      Summary:
      Level Compaction with TTL.
      
      As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
      
      Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
      This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
      
      This functionality can be controlled by the newly introduced column family option -- ttl.
      
      TODO for later:
      - Make ttl mutable
      - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
      - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
      Closes https://github.com/facebook/rocksdb/pull/3591
      
      Differential Revision: D7275442
      
      Pulled By: sagar0
      
      fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
      04c11b86
  30. 23 3月, 2018 1 次提交
  31. 14 3月, 2018 1 次提交
  32. 24 1月, 2018 2 次提交
  33. 12 12月, 2017 1 次提交
  34. 29 11月, 2017 1 次提交
    • P
      Support for block_cache num_shards and other config via option string. · 4b65cfc7
      Phani Shekhar Mantripragada 提交于
      Summary:
      Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M".
      It doesn't accept other parameters like num_shards etc.
      
      Changes :
      1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>".
      Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well.
      For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings.
      
      2) Corresponding unit tests are added.
      Closes https://github.com/facebook/rocksdb/pull/3108
      
      Differential Revision: D6420997
      
      Pulled By: sagar0
      
      fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43
      4b65cfc7
  35. 18 11月, 2017 1 次提交
  36. 17 11月, 2017 1 次提交
  37. 02 11月, 2017 1 次提交
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
  38. 01 11月, 2017 1 次提交