1. 22 7月, 2015 1 次提交
    • M
      [wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss · fe09a6da
      Mike Kolupaev 提交于
      Summary:
      I'll just copy internal task summary here:
      
      "
      This sequence will cause data loss in the middle after an sync write:
      
      non-sync write key 1
      flush triggered, not yet scheduled
      sync write key 2
      system crash
      
      After rebooting, users might see key 2 but not key 1, which violates the API of sync write.
      
      This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.
      
      One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
      "
      
      This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.
      
      Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind
      
      Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D40899
      fe09a6da
  2. 21 7月, 2015 1 次提交
    • A
      Improved FileExists API · 06429408
      agiardullo 提交于
      Summary: Add new CheckFileExists method.  Considered changing the FileExists api but didn't want to break anyone's builds.
      
      Test Plan: unit tests
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42003
      06429408
  3. 18 7月, 2015 3 次提交
    • S
      Move rate_limiter, write buffering, most perf context instrumentation and most... · 6e9fbeb2
      sdong 提交于
      Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
      
      Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
      
      Test Plan: Run all existing unit tests.
      
      Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D42321
      6e9fbeb2
    • I
      Don't let flushes preempt compactions · 35ca5936
      Igor Canadi 提交于
      Summary:
      When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
      
      We have a special case that when you set max_background_flushes to 0, we
      schedule the flush to execute on the compaction thread.
      
      Test Plan: make check (there might be some unit tests that depend on this behavior)
      
      Reviewers: IslamAbdelRahman, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D41931
      35ca5936
    • I
      Deprecate CompactionFilterV2 · a96fcd09
      Igor Canadi 提交于
      Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.
      
      Test Plan: make check
      
      Reviewers: noetzli, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42405
      a96fcd09
  4. 17 7月, 2015 1 次提交
    • S
      Fix data loss after DB recovery by not allowing flush/compaction to be scheduled until DB opened · 6c0c8dee
      sdong 提交于
      Summary:
      Previous run may leave some SST files with higher file numbers than manifest indicates.
      Compaction or flush may start to run while DB::Open() is still going on. SST file garbage collection may happen interleaving with compaction or flush, and overwrite files generated by compaction of flushes after they are generated. This might cause data loss. This possibility of interleaving is recently introduced.
      Fix it by not allowing compaction or flush to be scheduled before DB::Open() finishes.
      
      Test Plan: Add a unit test. This verification will have a chance to fail without the fix but doesn't fix without the fix.
      
      Reviewers: kradhakrishnan, anthony, yhchiang, IslamAbdelRahman, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42399
      6c0c8dee
  5. 16 7月, 2015 1 次提交
    • P
      Fixing delete files in Trivial move of universal compaction · beb19ad0
      Poornima Chozhiyath Raman 提交于
      Summary:
      Trvial move in universal compaction was failing when trying to move files from levels other than 0.
      This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels.
      This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move.
      
      Test Plan: ./db_test ran successfully with the new test cases.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D42135
      beb19ad0
  6. 14 7月, 2015 2 次提交
    • I
      Deprecate WriteOptions::timeout_hint_us · 5aea98dd
      Igor Canadi 提交于
      Summary:
      In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
      * Less code == better icache hit rate, smaller builds, simpler code
      * The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
      * Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.
      
      Test Plan: make check
      
      Reviewers: anthony, kradhakrishnan, sdong, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: sdong, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D41937
      5aea98dd
    • S
      "make format" against last 10 commits · f9728640
      sdong 提交于
      Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.
      
      Test Plan: Build it.
      
      Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D41961
      f9728640
  7. 11 7月, 2015 1 次提交
  8. 09 7月, 2015 1 次提交
  9. 08 7月, 2015 2 次提交
  10. 03 7月, 2015 1 次提交
    • M
      [wal changes 1/3] fixed unbounded wal growth in some workloads · 218487d8
      Mike Kolupaev 提交于
      Summary:
      This fixes the following scenario we've hit:
       - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one,
       - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_,
       - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in,
       - a few more hours pass and we run out disk space because of one huge .log file.
      
      Test Plan: `make check`; backported the new test, checked that it fails without this diff
      
      Reviewers: igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D40893
      218487d8
  11. 02 7月, 2015 1 次提交
    • D
      Windows Port from Microsoft · 18285c1e
      Dmitri Smirnov 提交于
       Summary: Make RocksDb build and run on Windows to be functionally
       complete and performant. All existing test cases run with no
       regressions. Performance numbers are in the pull-request.
      
       Test plan: make all of the existing unit tests pass, obtain perf numbers.
      
       Co-authored-by: Praveen Rao praveensinghrao@outlook.com
       Co-authored-by: Sherlock Huang baihan.huang@gmail.com
       Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
       Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
      18285c1e
  12. 26 6月, 2015 1 次提交
  13. 24 6月, 2015 2 次提交
    • I
      Bottommost level compaction option · 674b1181
      Islam AbdelRahman 提交于
      Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.
      
      Test Plan: make check
      
      Reviewers: sdong, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D40527
      674b1181
    • G
      Implement a table-level row cache · 782a1590
      Giuseppe Ottaviano 提交于
      Summary:
      Implementation of a table-level row cache.
      It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
      
      Supports snapshots and merge operations.
      
      Test Plan: Ran `make valgrind_check commit-prereq`
      
      Reviewers: igor, philipp, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39849
      782a1590
  14. 23 6月, 2015 2 次提交
    • K
      Introduce WAL recovery consistency levels · de85e4ca
      krad 提交于
      Summary:
      The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
      
      1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
      
      This mocks the current recovery mode.
      
      2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
      
      This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
      
      3. RecoverPointInTime (kPointInTimeRecovery)
      
      This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
      
      4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
      
      This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
      
      Test Plan:
      (1) Run added unit test to cover all levels.
      (2) Run make check.
      
      Reviewers: leveldb, sdong, igor
      
      Subscribers: yoshinorim, dhruba
      
      Differential Revision: https://reviews.facebook.net/D38487
      de85e4ca
    • I
      Fix trivial move merge · 530534fc
      Islam AbdelRahman 提交于
      Summary: Fixing bad merge
      
      Test Plan: make -j64 check (this is not enough to verify the fix)
      
      Reviewers: igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D40521
      530534fc
  15. 19 6月, 2015 2 次提交
    • I
      Fail DB::Open() when the requested compression is not available · 760e9a94
      Igor Canadi 提交于
      Summary:
      Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
      
      This patch fails DB::Open() if we don't support the compression that is specified in the options.
      
      Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
      
      Reviewers: sdong, MarkCallaghan, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D39687
      760e9a94
    • I
      Skip bottommost level compaction if possible · 4eabbdb7
      Islam AbdelRahman 提交于
      Summary:
      This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level
      
      Changes in this patch
      - Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
      - Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set
      
      Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that  should use force_bottommost_level_compaction to pass in a deterministic way
      
      Test Plan:
      make check
      adding new tests
      
      Reviewers: igor, sdong, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D40059
      4eabbdb7
  16. 18 6月, 2015 3 次提交
    • Y
      Fixed a bug of CompactionStats in multi-level universal compaction case · bb1c74ce
      Yueh-Hsuan Chiang 提交于
      Summary:
      Universal compaction can involves in multiple levels.  However,
      the current implementation of bytes_readn and bytes_readnp1
      (and some other stats with postfix `n` and `np1`) assumes compaction
      can only have two levels.
      
      This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
      * bytes_readnp1: the number of bytes read in the compaction output level.
      * bytes_readn: the total number of bytes read minus bytes_readnp1
      
      Test Plan: Add a test in compaction_job_stats_test
      
      Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D40239
      bb1c74ce
    • I
      Use CompactRangeOptions for CompactRange · 12e030a9
      Islam AbdelRahman 提交于
      Summary:
      This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
      Old CompactRange is still available but deprecated
      
      Test Plan:
      make all check
      make rocksdbjava
      USE_CLANG=1 make all
      OPT=-DROCKSDB_LITE make release
      
      Reviewers: sdong, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D40209
      12e030a9
    • I
      Clean up InstallSuperVersion · 25d60056
      Igor Canadi 提交于
      Summary:
      We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper.
      
      This diff removes the "optimization" and cleans up the code a bit.
      
      Test Plan: make check
      
      Reviewers: rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D40113
      25d60056
  17. 17 6月, 2015 1 次提交
  18. 13 6月, 2015 2 次提交
  19. 12 6月, 2015 4 次提交
    • S
      Slow down writes by bytes written · 7842920b
      sdong 提交于
      Summary:
      We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
      
      The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
      
      hard_rate_limit is deprecated.
      
      options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
      
      Test Plan: Add new unit tests in db_test
      
      Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
      
      Reviewed By: igor
      
      Subscribers: ikabiljo, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D36351
      7842920b
    • I
      Add largest sequence to FlushJobInfo · d6ce0f7c
      Islam AbdelRahman 提交于
      Summary:
      Adding largest sequence number to FlushJobInfo
      and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo
      
      Test Plan: make check
      
      Reviewers: igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39927
      d6ce0f7c
    • Y
      Add Env::GetThreadID(), which returns the ID of the current thread. · 3eddd1ab
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add Env::GetThreadID(), which returns the ID of the current thread.
      
      In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.
      
      Test Plan:
      db_test
      listener_test
      
      Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39735
      3eddd1ab
    • I
      Handling edge cases for ReFitLevel · 73faa3d4
      Islam AbdelRahman 提交于
      Summary:
      Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction
      - all files where in L0 (now maximum level is L1)
      - using kCompactionStyleUniversal (now maximum level in the last level)
      - level_compaction_dynamic_level_bytes ??
      
      We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel
      
      Test Plan:
      adding some tests
      make -j64 check
      
      Reviewers: igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: ott, dhruba
      
      Differential Revision: https://reviews.facebook.net/D39663
      73faa3d4
  20. 10 6月, 2015 1 次提交
    • V
      Fix hang when closing a DB after doing loads with WAL disabled. · 406a5682
      Venkatesh Radhakrishnan 提交于
      Summary:
      There is a hang during DB close in the following scenario:
      a) a load with WAL disabled was done,
      b) CancelAllBackgroundWork was called,
      c) DB Close was called
      This was because in that we will wait for a flush but we cannot do a
      background flush because we have called CancelAllBackgroundWork which
      marks the DB as shutting downn.
      
      Test Plan: Added DBTest FlushOnDestroy
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: yoshinorim, hermanlee4, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D39747
      406a5682
  21. 09 6月, 2015 1 次提交
  22. 06 6月, 2015 5 次提交
  23. 05 6月, 2015 1 次提交
    • I
      Allowing L0 -> L1 trivial move on sorted data · 3ce3bb3d
      Islam AbdelRahman 提交于
      Summary:
      This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
      
      The conditions for trivial move have been updated
      
      Introduced conditions:
        - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
        - Input level files cannot be overlapping
      
      Removed conditions:
        - Trivial move only run when the compaction is not manual
        - Input level should can contain only 1 file
      
      More context on what tests failed because of Trivial move
      ```
      DBTest.CompactionsGenerateMultipleFiles
      This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
      ```
      
      ```
      DBTest.NoSpaceCompactRange
      This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
      because trivial move does not need any extra space, and did not check for that
      ```
      
      ```
      DBTest.DropWrites
      Similar to DBTest.NoSpaceCompactRange
      ```
      
      ```
      DBTest.DeleteObsoleteFilesPendingOutputs
      This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
      ```
      
      ```
      CuckooTableDBTest.CompactionIntoMultipleFiles
      Same as DBTest.CompactionsGenerateMultipleFiles
      ```
      
      This diff is based on a work by @sdong https://reviews.facebook.net/D34149
      
      Test Plan: make -j64 check
      
      Reviewers: rven, sdong, igor
      
      Reviewed By: igor
      
      Subscribers: yhchiang, ott, march, dhruba, sdong
      
      Differential Revision: https://reviews.facebook.net/D34797
      3ce3bb3d