1. 12 4月, 2018 2 次提交
    • M
      WritePrepared Txn: fix smallest_prep atomicity issue · 6f5e6445
      Maysam Yabandeh 提交于
      Summary:
      We introduced smallest_prep optimization in this commit b225de7e, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data.
      This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically.
      
      Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically.
      Closes https://github.com/facebook/rocksdb/pull/3703
      
      Differential Revision: D7581934
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32
      6f5e6445
    • Y
      Improve visibility into the reasons for compaction. · d42bd041
      Yanqin Jin 提交于
      Summary:
      Add `compaction_reason` as part of event log for event `compaction started`.
      Add counters for each `CompactionReason`.
      Closes https://github.com/facebook/rocksdb/pull/3679
      
      Differential Revision: D7550348
      
      Pulled By: riversand963
      
      fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
      d42bd041
  2. 11 4月, 2018 3 次提交
    • A
      fix calling SetOptions on deprecated options · 019d7894
      Andrew Kryczka 提交于
      Summary:
      In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:
      
      ```
      db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
      ```
      
      We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
      Closes https://github.com/facebook/rocksdb/pull/3700
      
      Differential Revision: D7572596
      
      Pulled By: ajkr
      
      fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
      019d7894
    • Y
      fix some text in comments. · d95014b9
      Yanqin Jin 提交于
      Summary:
      1. Remove redundant text.
      2. Make terminology consistent across all comments and doc of RocksDB. Also do
         our best to conform to conventions. Specifically, use 'callback' instead of
         'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
      Closes https://github.com/facebook/rocksdb/pull/3693
      
      Differential Revision: D7560396
      
      Pulled By: riversand963
      
      fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
      d95014b9
    • Z
      make MockTimeEnv::current_time_ atomic to fix data race · 2770a94c
      Zhongyi Xie 提交于
      Summary:
      fix a new TSAN failure
      https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
      Closes https://github.com/facebook/rocksdb/pull/3694
      
      Differential Revision: D7565310
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825
      2770a94c
  3. 10 4月, 2018 5 次提交
  4. 08 4月, 2018 2 次提交
    • M
      WritePrepared Txn: add stats · bde1c1a7
      Maysam Yabandeh 提交于
      Summary:
      Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
      Closes https://github.com/facebook/rocksdb/pull/3683
      
      Differential Revision: D7529393
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
      bde1c1a7
    • M
      WritePrepared Txn: add write_committed option to dump_wal · eb5a2954
      Maysam Yabandeh 提交于
      Summary:
      Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work.
      
      Tested by running it on a WAL generated by WRITE_PREPARED:
      $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log  | grep BEGIN_PREARE | head -1
      1,2,70,0,BEGIN_PREARE
      $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false | grep BEGIN_PREARE | head -1
      1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30)
      Closes https://github.com/facebook/rocksdb/pull/3682
      
      Differential Revision: D7522090
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca
      eb5a2954
  5. 07 4月, 2018 2 次提交
  6. 06 4月, 2018 8 次提交
    • A
      protect valid backup files when max_valid_backups_to_open is set · faba3fb5
      Andrew Kryczka 提交于
      Summary:
      When `max_valid_backups_to_open` is set, the `BackupEngine` doesn't know about the files referenced by existing backups. This PR prevents us from deleting valid files when that option is set, in cases where we are unable to accurately determine refcount. There are warnings logged when we may miss deleting unreferenced files, and a recommendation in the header for users to periodically unset this option and run a full `GarbageCollect`.
      Closes https://github.com/facebook/rocksdb/pull/3518
      
      Differential Revision: D7008331
      
      Pulled By: ajkr
      
      fbshipit-source-id: 87907f964dc9716e229d08636a895d2fc7b72305
      faba3fb5
    • Z
      fix shared libary compile on ppc · 65717700
      zhsj 提交于
      Summary:
      shared-ppc-objects is missed in $(SHARED4) target
      Closes https://github.com/facebook/rocksdb/pull/3619
      
      Differential Revision: D7475767
      
      Pulled By: ajkr
      
      fbshipit-source-id: d957ac7290bab3cd542af504405fb5ff912bfbf1
      65717700
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
    • M
      Stats for false positive rate of full filtesr · 67182678
      Maysam Yabandeh 提交于
      Summary:
      Adds two stats to allow us measuring the false positive rate of full filters:
      - The total count of positives: rocksdb.bloom.filter.full.positive
      - The total count of true positives: rocksdb.bloom.filter.full.true.positive
      Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches.
      
      Closes #3680
      
      Tested by:
      $ ./db_bench -benchmarks=fillrandom  -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10
      $ ./db_bench -benchmarks="readwhilewriting"  -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log
      $ grep filter.full /tmp/full.log
      rocksdb.bloom.filter.full.positive COUNT : 3628593
      rocksdb.bloom.filter.full.true.positive COUNT : 3536026
      which gives the false positive rate of 2.5%
      Closes https://github.com/facebook/rocksdb/pull/3681
      
      Differential Revision: D7517570
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052
      67182678
    • Y
      Clock cache should check if deleter is nullptr before calling it · 685912d0
      Yi Wu 提交于
      Summary:
      Clock cache should check if deleter is nullptr before calling it.
      Closes https://github.com/facebook/rocksdb/pull/3677
      
      Differential Revision: D7493602
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 4f94b188d2baf2cbc7c0d5da30fea1215a683de4
      685912d0
    • D
      Fix pre_release callback argument list. · 147dfc7b
      Dmitri Smirnov 提交于
      Summary:
      Primitive types constness does not affect the signature of the
        method and has no influence on whether the overriding method would
        actually have that const bool instead of just bool. In addition,
        it is rarely useful but does produce a compatibility warnings
        in VS 2015 compiler.
      Closes https://github.com/facebook/rocksdb/pull/3663
      
      Differential Revision: D7475739
      
      Pulled By: ajkr
      
      fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f
      147dfc7b
    • Y
      Blob DB: blob_dump to show uncompressed values · 36a9f229
      Yi Wu 提交于
      Summary:
      Make blob_dump tool able to show uncompressed values if the blob file is compressed. Also show total compressed vs. raw size at the end if --show_summary is provided.
      Closes https://github.com/facebook/rocksdb/pull/3633
      
      Differential Revision: D7348926
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ca709cb4ed5cf6a550ff2987df8033df81516f8e
      36a9f229
    • Z
      fix build for rocksdb lite · c827b2dc
      Zhongyi Xie 提交于
      Summary:
      currently rocksdb lite build fails due to the following errors:
      > db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type
         virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                                                         ^
      db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override
         virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
                      ^
      db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor]
       class FlushedFileCollector : public EventListener {
             ^
      db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’:
      db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’
           flushed_files_.push_back(info.file_path);
                                         ^
      cc1plus: all warnings being treated as errors
      make: *** [db/db_sst_test.o] Error 1
      Closes https://github.com/facebook/rocksdb/pull/3676
      
      Differential Revision: D7493006
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f
      c827b2dc
  7. 05 4月, 2018 1 次提交
  8. 04 4月, 2018 2 次提交
    • D
      Make Optimistic Tx database stackable · 2a62ca17
      Dmitri Smirnov 提交于
      Summary:
      This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr.
      
      Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db.
      Closes https://github.com/facebook/rocksdb/pull/3566
      
      Differential Revision: D7184502
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3
      2a62ca17
    • A
      Reduce default --nooverwritepercent in black-box crash tests · b058a337
      Andrew Kryczka 提交于
      Summary:
      Previously `python tools/db_crashtest.py blackbox` would do no useful work as the crash interval (two minutes) was shorter than the preparation phase. The preparation phase is slow because of the ridiculously inefficient way it computes which keys should not be overwritten. It was doing this for 60M keys since default values were `FLAGS_nooverwritepercent == 60` and `FLAGS_max_key == 100000000`.
      
      Move the "nooverwritepercent" override from whitebox-specific to the general options so it also applies to blackbox test runs. Now preparation phase takes a few seconds.
      Closes https://github.com/facebook/rocksdb/pull/3671
      
      Differential Revision: D7457732
      
      Pulled By: ajkr
      
      fbshipit-source-id: 601f4461a6a7e49e50449dcf15aebc9b8a98d6f0
      b058a337
  9. 03 4月, 2018 6 次提交
    • A
      Some small improvements to the build_tools · 12b400e8
      Adam Retter 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/3664
      
      Differential Revision: D7459433
      
      Pulled By: sagar0
      
      fbshipit-source-id: 3817e5d45fc70e83cb26f9800eaa0f4566c8dc0e
      12b400e8
    • S
      Level Compaction with TTL · 04c11b86
      Sagar Vemuri 提交于
      Summary:
      Level Compaction with TTL.
      
      As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
      
      Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
      This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
      
      This functionality can be controlled by the newly introduced column family option -- ttl.
      
      TODO for later:
      - Make ttl mutable
      - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
      - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
      Closes https://github.com/facebook/rocksdb/pull/3591
      
      Differential Revision: D7275442
      
      Pulled By: sagar0
      
      fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
      04c11b86
    • K
      Fix 3-way SSE4.2 crc32c usage in MSVC with CMake · df144244
      Koby Kahane 提交于
      Summary:
      The introduction of the 3-way SSE4.2 optimized crc32c implementation in commit f54d7f5f added the `HAVE_PCLMUL` definition when the compiler supports intrinsics for that instruction, but did not modify CMakeLists.txt to set that definition on MSVC when appropriate. As a result, 3-way SSE4.2 is not used in MSVC builds with CMake although it could be.
      
      Since the existing test program in CMakeLists.txt for `HAVE_SSE42` already uses `_mm_clmulepi64_si128` which is a PCLMUL instruction, this PR sets `HAVE_PCLMUL` as well if that program builds successfully, fixing the problem.
      Closes https://github.com/facebook/rocksdb/pull/3673
      
      Differential Revision: D7473975
      
      Pulled By: miasantreble
      
      fbshipit-source-id: bc346b9eb38920e427aa1a253e6dd9811efa269e
      df144244
    • M
      WritePrepared Txn: smallest_prepare optimization · b225de7e
      Maysam Yabandeh 提交于
      Summary:
      The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
      To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
      Closes https://github.com/facebook/rocksdb/pull/3649
      
      Differential Revision: D7388630
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
      b225de7e
    • A
      Enable cancelling manual compactions if they hit the sfm size limit · 1579626d
      Amy Tai 提交于
      Summary:
      Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true.
      Closes https://github.com/facebook/rocksdb/pull/3670
      
      Differential Revision: D7457683
      
      Pulled By: amytai
      
      fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5
      1579626d
    • Z
      Revert "Avoid adding tombstones of the same file to RangeDelAggregato… · 44653c7b
      Zhongyi Xie 提交于
      Summary:
      …r multiple times"
      
      This reverts commit e80709a3.
      
      lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads
      I'm reverting the commit for now but feel free to submit new patches 😃
      
      To reproduce the regression, you can run the following db_bench command
      > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none
      
      write stats printed by db_bench:
      
      Table | | | | | | | | | | |
       --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
      revert commit | Percentiles: | P50: | 80.77  | P75: |102.94  |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 |
      keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70|  P99.99: 2814.29|
      Closes https://github.com/facebook/rocksdb/pull/3672
      
      Differential Revision: D7463315
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d
      44653c7b
  10. 02 4月, 2018 1 次提交
  11. 31 3月, 2018 4 次提交
    • F
      Throw NoSpace instead of IOError when out of space. · d12112d0
      Fosco Marotto 提交于
      Summary:
      Replaces #1702 and is updated from feedback.
      Closes https://github.com/facebook/rocksdb/pull/3531
      
      Differential Revision: D7457395
      
      Pulled By: gfosco
      
      fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82
      d12112d0
    • F
      Update buckifier and TARGETS · d9bfb35d
      Fosco Marotto 提交于
      Summary:
      Some flags used via make were not applied in the buckifier/targets file, causing some failures to be missed by testing infra ( ie the one fixed by #3434 )
      Closes https://github.com/facebook/rocksdb/pull/3452
      
      Differential Revision: D7457419
      
      Pulled By: gfosco
      
      fbshipit-source-id: e4aed2915ca3038c1485bbdeebedfc33d5704a49
      d9bfb35d
    • F
      Update 64-bit shift in compression.h · c3eb762b
      Fosco Marotto 提交于
      Summary:
      This was failing the build on windows with zstd, warning treated as an error, 32-bit shift implicitly converted to 64-bit.
      Closes https://github.com/facebook/rocksdb/pull/3624
      
      Differential Revision: D7307883
      
      Pulled By: gfosco
      
      fbshipit-source-id: 68110e9b5b1b59b668dec6cf86b67556402574e7
      c3eb762b
    • M
      Skip deleted WALs during recovery · 73f21a7b
      Maysam Yabandeh 提交于
      Summary:
      This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      Closes https://github.com/facebook/rocksdb/pull/3488
      
      Differential Revision: D6967893
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
      73f21a7b
  12. 30 3月, 2018 1 次提交
    • M
      WritePrepared Txn: fix a bug in publishing recoverable state seq · 89d989ed
      Maysam Yabandeh 提交于
      Summary:
      When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq.
      Closes https://github.com/facebook/rocksdb/pull/3665
      
      Differential Revision: D7446099
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b
      89d989ed
  13. 29 3月, 2018 3 次提交
    • A
      Allow rocksdbjavastatic to also be built as debug build · 3cb59195
      Adam Retter 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/3654
      
      Differential Revision: D7417948
      
      Pulled By: sagar0
      
      fbshipit-source-id: 9514df9328181e54a6384764444c0c7ce66e7f5f
      3cb59195
    • M
      WritePrepared Txn: make recoverable state visible after flush · 0377ff9d
      Maysam Yabandeh 提交于
      Summary:
      Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
      Closes https://github.com/facebook/rocksdb/pull/3661
      
      Differential Revision: D7424577
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
      0377ff9d
    • Y
      Fix race condition causing double deletion of ssts · 1f5def16
      Yanqin Jin 提交于
      Summary:
      Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files,  but this PR targets sst.
      Also add a unit test to verify that this PR fixes the issue.
      
      The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
      ```
      timeline              user_thread                background_compaction thread
      t1   |                                          FindObsoleteFiles(full_scan=false)
      t2   |     FindObsoleteFiles(full_scan=true)
      t3   |                                          PurgeObsoleteFiles
      t4   |     PurgeObsoleteFiles
           V
      ```
      When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
      To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.
      
      ajkr could you take a look and comment? Thanks!
      Closes https://github.com/facebook/rocksdb/pull/3638
      
      Differential Revision: D7384372
      
      Pulled By: riversand963
      
      fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
      1f5def16