1. 19 3月, 2014 1 次提交
  2. 18 3月, 2014 2 次提交
    • I
      Optimize fallocation · f26cb0f0
      Igor Canadi 提交于
      Summary:
      Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.
      
      This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.
      
      At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.
      
      Test Plan: make check
      
      Reviewers: dhruba, sdong, haobo, ljin
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16761
      f26cb0f0
    • I
      Fix race condition in manifest roll · ae25742a
      Igor Canadi 提交于
      Summary:
      When the manifest is getting rolled the following happens:
      1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
      2) mutex is unlocked
      3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
      4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
      5) mutex is locked
      
      If FindObsoleteFiles happens between (3) and (4) it will:
      1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
      2) Delete old manifest (because the manifest_file_number_ already points to a new one)
      
      I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.
      
      However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?
      
      Test Plan: make check
      
      Reviewers: ljin, dhruba, haobo, sdong
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16929
      ae25742a
  3. 14 3月, 2014 1 次提交
    • S
      Fix extra compaction tasks scheduled after D16767 in some cases · 5aa81f04
      sdong 提交于
      Summary:
      With D16767, there is a case compaction tasks are scheduled infinitely:
      (1) no flush thread is configured and more than 1 compaction threads
      (2) a flush is going on by one compaction hread
      (3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
      In that case, a infinite loop will be formed.
      
      This patch would fix it.
      
      Test Plan: make all check
      
      Reviewers: haobo, igor, ljin
      
      Reviewed By: igor
      
      CC: dhruba, yhchiang, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16863
      5aa81f04
  4. 13 3月, 2014 2 次提交
  5. 12 3月, 2014 3 次提交
    • S
      Fix data race against logging data structure because of LogBuffer · bd45633b
      sdong 提交于
      Summary:
      @igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.
      
      With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.
      
      Thanks @igor for this finding!
      
      Test Plan: make all check
      
      Reviewers: haobo, igor
      
      Reviewed By: haobo
      
      CC: dhruba, ljin, yhchiang, igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16767
      bd45633b
    • S
      Temp Fix of LogBuffer flushing · 6c66bc08
      sdong 提交于
      Summary: To temp fix the log buffer flushing. Flush the buffer inside the lock. Clean the trunk before we find an eventual fix.
      
      Test Plan: make all check
      
      Reviewers: haobo, igor
      
      Reviewed By: igor
      
      CC: ljin, leveldb, yhchiang
      
      Differential Revision: https://reviews.facebook.net/D16791
      6c66bc08
    • I
      Add a comment after SignalAll() · cb980216
      Igor Canadi 提交于
      Summary: Having code after SignalAll has already caused 2 bugs. Let's make sure this doesn't happen again.
      
      Test Plan: no test
      
      Reviewers: sdong, dhruba, haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16785
      cb980216
  6. 11 3月, 2014 5 次提交
  7. 08 3月, 2014 1 次提交
  8. 07 3月, 2014 1 次提交
  9. 06 3月, 2014 1 次提交
    • S
      Buffer info logs when picking compactions and write them out after releasing the mutex · ecb1ffa2
      sdong 提交于
      Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.
      
      Test Plan:
      make all check
      check the log lines while running some tests that trigger compactions.
      
      Reviewers: haobo, igor, dhruba
      
      Reviewed By: dhruba
      
      CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-
      
      Differential Revision: https://reviews.facebook.net/D16515
      ecb1ffa2
  10. 05 3月, 2014 1 次提交
  11. 01 3月, 2014 2 次提交
    • I
      Make Log::Reader more robust · 58ca641d
      Igor Canadi 提交于
      Summary:
      This diff does two things:
      (1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
      (2) Turn off mmap writes for all writes to log and manifest files
      
      (2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.
      
      Test Plan:
      Added unit tests from LevelDB
      Actually recovered a "corrupted" MANIFEST file.
      
      Reviewers: dhruba, haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16119
      58ca641d
    • Y
      Add ReadOptions to TransactionLogIterator. · a77527f2
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
      which allows the verification of checksums to be disabled by setting
      ReadOptions::verify_checksums to false.
      
      Test Plan: Tests are done off-line and will not be included in the regular unit test.
      
      Reviewers: igor
      
      Reviewed By: igor
      
      CC: leveldb, xjin, dhruba
      
      Differential Revision: https://reviews.facebook.net/D16305
      a77527f2
  12. 28 2月, 2014 1 次提交
  13. 26 2月, 2014 1 次提交
    • I
      Schedule flush when waiting on flush · 42095163
      Igor Canadi 提交于
      Summary:
      This will also help with avoiding the deadlock. If a flush failed and we're waiting for a memtable to be flushed, we should schedule a new flush and hope a new one succeedes.
      
      If paranoid_checks = false, Wait() will still hang on ENOSPC, but at least it will automatically continue when the space frees up. Current behavior both hangs and deadlocks.
      
      Also, I renamed some 'compaction' to 'flush'. 'compaction' was leveldb way of saying things.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, ljin
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16281
      42095163
  14. 25 2月, 2014 1 次提交
  15. 20 2月, 2014 1 次提交
  16. 14 2月, 2014 1 次提交
  17. 13 2月, 2014 4 次提交
  18. 11 2月, 2014 1 次提交
  19. 04 2月, 2014 2 次提交
  20. 03 2月, 2014 1 次提交
  21. 01 2月, 2014 2 次提交
    • I
      Mark the log_number file number used · dbbffbd7
      Igor Canadi 提交于
      Summary:
      VersionSet::next_file_number_ is always assumed to be strictly greater than VersionSet::log_number_. In our new recovery code, we artificially set log_number_  to be (log_number + 1), so that once we flush, we don't recover from the same log file again (this is important because of merge operator non-idempotence)
      
      When we set VersionSet::log_number_ to (log_number + 1), we also have to mark that file number used, such that next_file_number_ is increased to a legal level. Otherwise, VersionSet might assert.
      
      This has not be a problem so far because here's what happens:
      1. assume next_file_number is 5, we're recovering log_number 10
      2. in DBImpl::Recover() we call MarkFileNumberUsed with 10. This will set VersionSet::next_file_number_ to 11.
      3. If there are some updates, we will call WriteTable0ForRecovery(), which will use file number 11 as a new table file and advance VersionSet::next_file_number_ to 12.
      4. When we LogAndApply() with log_number 11, assertion is true: assert(11 <= 12);
      
      However, this was a lucky occurrence. Even though this diff doesn't cause a bug, I think the issue is important to fix.
      
      Test Plan: In column families I have different recovery logic and this code path asserted. When adding MarkFileNumberUsed(log_number + 1) assert is gone.
      
      Reviewers: dhruba, kailiu
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15783
      dbbffbd7
    • S
      When using Universal Compaction, Zero out seqID in the last file too · 56bea9f8
      Siying Dong 提交于
      Summary: I didn't figure out the reason why the feature of zeroing out earlier sequence ID is disabled in universal compaction. I do see bottommost_level is set correctly. It should simply work if we remove the constraint of universal compaction.
      
      Test Plan: make all check
      
      Reviewers: haobo, dhruba, kailiu, igor
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15423
      56bea9f8
  22. 30 1月, 2014 2 次提交
    • I
      InternalStatistics · 3c0dcf0e
      Igor Canadi 提交于
      Summary:
      In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste.
      
      Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class.
      
      Test Plan: make check
      
      Reviewers: dhruba, kailiu, haobo, sdong, emayanke
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15273
      3c0dcf0e
    • L
      set bg_error_ when background flush goes wrong · d118707f
      Lei Jin 提交于
      Summary: as title
      
      Test Plan: unit test
      
      Reviewers: haobo, igor, sdong, kailiu, dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15435
      d118707f
  23. 28 1月, 2014 3 次提交
    • M
      Update monitoring to include average time per compaction and stall · 90f29ccb
      Mark Callaghan 提交于
      Summary:
      The new columns are msComp and msStall that provide average time per compaction and stall for that level in milliseconds.
      Level  Files Size(MB) Score Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count   msComp   msStall  Ln-stall Stall-cnt
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        0        8       15   1.5         2         0        30         0         0        30        0.0       0.0        15.5        0        0        0        0       16      112       0.2       1.3      7568
        1        8       16   1.6         1        26        26        15        11        16        3.5      17.6        18.1        8        6       13        7        3      362       0.0       0.0         0
        2        1        2   0.0         0         0         2         0         0         2        0.0       0.0        18.4        0        0        0        0        1       50       0.0       0.0         0
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15345
      90f29ccb
    • I
      Fsync directory after we create a new file · 832158e7
      Igor Canadi 提交于
      Summary:
      @dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.
      
      Should I also add FsyncDir() to new files that get created by a compaction?
      
      Test Plan: Confirmed that FsyncDir is returning Status::OK()
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D14751
      832158e7
    • I
      Move NeedsCompaction() from VersionSet to Version · 6c2ca1d3
      Igor Canadi 提交于
      Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.
      
      Test Plan: make check
      
      Reviewers: kailiu, haobo, sdong
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15333
      6c2ca1d3