1. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  2. 02 5月, 2018 1 次提交
  3. 28 4月, 2018 3 次提交
  4. 27 4月, 2018 2 次提交
  5. 26 4月, 2018 1 次提交
  6. 24 4月, 2018 2 次提交
    • M
      Improve write time breakdown stats · affe01b0
      Mike Kolupaev 提交于
      Summary:
      There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.
      
      These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
      Closes https://github.com/facebook/rocksdb/pull/3602
      
      Differential Revision: D7251562
      
      Pulled By: al13n321
      
      fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
      affe01b0
    • S
      Revert "Skip deleted WALs during recovery" · d5afa737
      Siying Dong 提交于
      Summary:
      This reverts commit 73f21a7b.
      
      It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:
      
      "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"
      
      This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
      Closes https://github.com/facebook/rocksdb/pull/3762
      
      Differential Revision: D7730035
      
      Pulled By: siying
      
      fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
      d5afa737
  7. 21 4月, 2018 3 次提交
    • A
      Add a stat for MultiGet keys found, update memtable hit/miss stats · dbdaa466
      Anand Ananthabhotla 提交于
      Summary:
      1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
      number of keys successfully read
      2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
      DBImpl::GetImpl(), but not MultiGet
      Closes https://github.com/facebook/rocksdb/pull/3730
      
      Differential Revision: D7677364
      
      Pulled By: anand1976
      
      fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
      dbdaa466
    • M
      WritePrepared Txn: enable TryAgain for duplicates at the end of the batch · c3d1e36c
      Maysam Yabandeh 提交于
      Summary:
      The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
      Closes https://github.com/facebook/rocksdb/pull/3747
      
      Differential Revision: D7708391
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
      c3d1e36c
    • P
      Fix GitHub issue #3716: gcc-8 warnings · dee95a1a
      przemyslaw.skibinski@percona.com 提交于
      Summary:
      Fix the following gcc-8 warnings:
      - conflicting C language linkage declaration [-Werror]
      - writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
      - array subscript -1 is below array bounds [-Werror=array-bounds]
      
      Solves https://github.com/facebook/rocksdb/issues/3716
      Closes https://github.com/facebook/rocksdb/pull/3736
      
      Differential Revision: D7684161
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a
      dee95a1a
  8. 20 4月, 2018 1 次提交
  9. 19 4月, 2018 1 次提交
  10. 17 4月, 2018 1 次提交
  11. 16 4月, 2018 1 次提交
  12. 14 4月, 2018 2 次提交
    • Z
      add kEntryRangeDeletion · 31ee4bf2
      zhangjinpeng1987 提交于
      Summary:
      When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
      After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
      Closes https://github.com/facebook/rocksdb/pull/3695
      
      Differential Revision: D7570322
      
      Pulled By: ajkr
      
      fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
      31ee4bf2
    • Y
      Improve accuracy of I/O stats collection of external SST ingestion. · c81b0abe
      Yanqin Jin 提交于
      Summary:
      RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.
      
      Rename a sync point.
      Closes https://github.com/facebook/rocksdb/pull/3713
      
      Differential Revision: D7604151
      
      Pulled By: riversand963
      
      fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586
      c81b0abe
  13. 13 4月, 2018 1 次提交
  14. 12 4月, 2018 1 次提交
  15. 11 4月, 2018 3 次提交
    • A
      fix calling SetOptions on deprecated options · 019d7894
      Andrew Kryczka 提交于
      Summary:
      In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:
      
      ```
      db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
      ```
      
      We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
      Closes https://github.com/facebook/rocksdb/pull/3700
      
      Differential Revision: D7572596
      
      Pulled By: ajkr
      
      fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
      019d7894
    • Y
      fix some text in comments. · d95014b9
      Yanqin Jin 提交于
      Summary:
      1. Remove redundant text.
      2. Make terminology consistent across all comments and doc of RocksDB. Also do
         our best to conform to conventions. Specifically, use 'callback' instead of
         'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
      Closes https://github.com/facebook/rocksdb/pull/3693
      
      Differential Revision: D7560396
      
      Pulled By: riversand963
      
      fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
      d95014b9
    • Z
      make MockTimeEnv::current_time_ atomic to fix data race · 2770a94c
      Zhongyi Xie 提交于
      Summary:
      fix a new TSAN failure
      https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
      Closes https://github.com/facebook/rocksdb/pull/3694
      
      Differential Revision: D7565310
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825
      2770a94c
  16. 10 4月, 2018 3 次提交
    • G
      Change a comment · 65fe8d6c
      Gihwan Oh 提交于
      Summary:
      In this case, we add input files of compaction, not outputs.
      Closes https://github.com/facebook/rocksdb/pull/3686
      
      Differential Revision: D7556781
      
      Pulled By: ajkr
      
      fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7
      65fe8d6c
    • A
      fix intra-L0 FIFO for uncompressed use case · 1c27cbfb
      Andrew Kryczka 提交于
      Summary:
      - inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
      - fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
      Closes https://github.com/facebook/rocksdb/pull/3684
      
      Differential Revision: D7530396
      
      Pulled By: ajkr
      
      fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e
      1c27cbfb
    • Z
      fix data race · f3a1d9e0
      Zhongyi Xie 提交于
      Summary:
      Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`:
      https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899
      Closes https://github.com/facebook/rocksdb/pull/3691
      
      Differential Revision: D7541400
      
      Pulled By: miasantreble
      
      fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb
      f3a1d9e0
  17. 08 4月, 2018 1 次提交
    • M
      WritePrepared Txn: add stats · bde1c1a7
      Maysam Yabandeh 提交于
      Summary:
      Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
      Closes https://github.com/facebook/rocksdb/pull/3683
      
      Differential Revision: D7529393
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
      bde1c1a7
  18. 07 4月, 2018 1 次提交
  19. 06 4月, 2018 3 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
    • D
      Fix pre_release callback argument list. · 147dfc7b
      Dmitri Smirnov 提交于
      Summary:
      Primitive types constness does not affect the signature of the
        method and has no influence on whether the overriding method would
        actually have that const bool instead of just bool. In addition,
        it is rarely useful but does produce a compatibility warnings
        in VS 2015 compiler.
      Closes https://github.com/facebook/rocksdb/pull/3663
      
      Differential Revision: D7475739
      
      Pulled By: ajkr
      
      fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f
      147dfc7b
    • Z
      fix build for rocksdb lite · c827b2dc
      Zhongyi Xie 提交于
      Summary:
      currently rocksdb lite build fails due to the following errors:
      > db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
         virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                                                         ^
      db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
         virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                      ^
      db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
       class FlushedFileCollector : public EventListener {
             ^
      db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
      db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
           flushed_files_.push_back(info.file_path);
                                         ^
      cc1plus: all warnings being treated as errors
      make: *** [db/db_sst_test.o] Error 1
      Closes https://github.com/facebook/rocksdb/pull/3676
      
      Differential Revision: D7493006
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f
      c827b2dc
  20. 05 4月, 2018 1 次提交
  21. 03 4月, 2018 4 次提交
    • S
      Level Compaction with TTL · 04c11b86
      Sagar Vemuri 提交于
      Summary:
      Level Compaction with TTL.
      
      As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
      
      Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
      This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
      
      This functionality can be controlled by the newly introduced column family option -- ttl.
      
      TODO for later:
      - Make ttl mutable
      - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
      - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
      Closes https://github.com/facebook/rocksdb/pull/3591
      
      Differential Revision: D7275442
      
      Pulled By: sagar0
      
      fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
      04c11b86
    • M
      WritePrepared Txn: smallest_prepare optimization · b225de7e
      Maysam Yabandeh 提交于
      Summary:
      The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
      To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
      Closes https://github.com/facebook/rocksdb/pull/3649
      
      Differential Revision: D7388630
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
      b225de7e
    • A
      Enable cancelling manual compactions if they hit the sfm size limit · 1579626d
      Amy Tai 提交于
      Summary:
      Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
      Closes https://github.com/facebook/rocksdb/pull/3670
      
      Differential Revision: D7457683
      
      Pulled By: amytai
      
      fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
      1579626d
    • Z
      Revert "Avoid adding tombstones of the same file to RangeDelAggregato… · 44653c7b
      Zhongyi Xie 提交于
      Summary:
      …r multiple times"
      
      This reverts commit e80709a3.
      
      lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads
      I'm reverting the commit for now but feel free to submit new patches 😃
      
      To reproduce the regression, you can run the following db_bench command
      > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none
      
      write stats printed by db_bench:
      
      Table | | | | | | | | | | |
       --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
      revert commit | Percentiles: | P50: | 80.77  | P75: |102.94  |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 |
      keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70|  P99.99: 2814.29|
      Closes https://github.com/facebook/rocksdb/pull/3672
      
      Differential Revision: D7463315
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d
      44653c7b
  22. 31 3月, 2018 2 次提交
    • F
      Throw NoSpace instead of IOError when out of space. · d12112d0
      Fosco Marotto 提交于
      Summary:
      Replaces #1702 and is updated from feedback.
      Closes https://github.com/facebook/rocksdb/pull/3531
      
      Differential Revision: D7457395
      
      Pulled By: gfosco
      
      fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82
      d12112d0
    • M
      Skip deleted WALs during recovery · 73f21a7b
      Maysam Yabandeh 提交于
      Summary:
      This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      Closes https://github.com/facebook/rocksdb/pull/3488
      
      Differential Revision: D6967893
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
      73f21a7b
  23. 30 3月, 2018 1 次提交
    • M
      WritePrepared Txn: fix a bug in publishing recoverable state seq · 89d989ed
      Maysam Yabandeh 提交于
      Summary:
      When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
      Closes https://github.com/facebook/rocksdb/pull/3665
      
      Differential Revision: D7446099
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b
      89d989ed