1. 19 3月, 2014 2 次提交
    • I
      Don't Finalize in CompactionPicker · 758fa8c3
      Igor Canadi 提交于
      Summary:
      Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.
      
      Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console
      
      Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.
      
      Short-term, I removed Finalize() from CompactionPicker.
      
      Note: I believe this is an issue in current 2.7 version running in production.
      
      Test Plan:
      make check
      Will also run db_stress to see if issue is gone
      
      Reviewers: sdong, ljin, dhruba, haobo
      
      Reviewed By: sdong
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16983
      758fa8c3
    • L
      disable the log_number check in Recover() · 63cef900
      Lei Jin 提交于
      Summary:
      There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
      This check would fail them. Change it to log instead of returning a
      Corruption status.
      
      Test Plan: make
      
      Reviewers: haobo, igor
      
      Reviewed By: igor
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16923
      63cef900
  2. 18 3月, 2014 4 次提交
    • I
      Optimize fallocation · f26cb0f0
      Igor Canadi 提交于
      Summary:
      Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.
      
      This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.
      
      At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.
      
      Test Plan: make check
      
      Reviewers: dhruba, sdong, haobo, ljin
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16761
      f26cb0f0
    • I
      Fix race condition in manifest roll · ae25742a
      Igor Canadi 提交于
      Summary:
      When the manifest is getting rolled the following happens:
      1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
      2) mutex is unlocked
      3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
      4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
      5) mutex is locked
      
      If FindObsoleteFiles happens between (3) and (4) it will:
      1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
      2) Delete old manifest (because the manifest_file_number_ already points to a new one)
      
      I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.
      
      However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?
      
      Test Plan: make check
      
      Reviewers: ljin, dhruba, haobo, sdong
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16929
      ae25742a
    • Y
      Correct the logic of MemTable::ShouldFlushNow(). · a5fafd4f
      Yueh-Hsuan Chiang 提交于
      Summary:
      Memtable will now be forced to flush if the one of the following
      conditions is met:
      1. Already allocated more than write_buffer_size + 60% arena block size.
         (the overflowing condition)
      2. Unable to safely allocate one more arena block without hitting the
         overflowing condition AND the unused allocated memory < 25% arena
         block size.
      
      Test Plan: make all check
      
      Reviewers: sdong, haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16893
      a5fafd4f
    • S
      Fix a bug that Prev() can hang. · c61c9830
      sdong 提交于
      Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it.
      
      Test Plan: make all check
      
      Reviewers: igor, haobo
      
      Reviewed By: igor
      
      CC: ljin, yhchiang, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16899
      c61c9830
  3. 16 3月, 2014 1 次提交
  4. 15 3月, 2014 5 次提交
    • L
      journal log_number correctly in MANIFEST · 453ec52c
      Lei Jin 提交于
      Summary:
      Here is what it can cause probelm:
      There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.
      
      Test Plan:
      make whitebox_crash_test
      got another assertion about iter->valid(), not sure if that is related
      to this.
      
      Reviewers: igor, haobo
      
      Reviewed By: igor
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16875
      453ec52c
    • C
      Make it compile on Debian/GCC 4.7 · b9c78d2d
      Caio SBA 提交于
      b9c78d2d
    • I
      Missing includes · 2bad3cb0
      Igor Canadi 提交于
      2bad3cb0
    • I
      Fix HashSkipList and HashLinkedList SIGSEGV · 3c75cc15
      Igor Canadi 提交于
      Summary:
      Original Summary:
      Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.
      
      Update:
      Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.
      
      Test Plan: SIGSEGV and assertion-throwing test now works!
      
      Reviewers: ljin, haobo, sdong
      
      Reviewed By: sdong
      
      CC: leveldb, ljin
      
      Differential Revision: https://reviews.facebook.net/D16857
      3c75cc15
    • I
      Fix warning on Mac OS · 6c72079d
      Igor Canadi 提交于
      6c72079d
  5. 14 3月, 2014 1 次提交
    • S
      Fix extra compaction tasks scheduled after D16767 in some cases · 5aa81f04
      sdong 提交于
      Summary:
      With D16767, there is a case compaction tasks are scheduled infinitely:
      (1) no flush thread is configured and more than 1 compaction threads
      (2) a flush is going on by one compaction hread
      (3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
      In that case, a infinite loop will be formed.
      
      This patch would fix it.
      
      Test Plan: make all check
      
      Reviewers: haobo, igor, ljin
      
      Reviewed By: igor
      
      CC: dhruba, yhchiang, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16863
      5aa81f04
  6. 13 3月, 2014 3 次提交
  7. 12 3月, 2014 4 次提交
    • S
      Fix bad merge of D16791 and D16767 · 839c8ecf
      sdong 提交于
      Summary: A bad Auto-Merge caused log buffer is flushed twice. Remove the unintended one.
      
      Test Plan: Should already be tested (the code looks the same as when I ran unit tests).
      
      Reviewers: haobo, igor
      
      Reviewed By: haobo
      
      CC: ljin, yhchiang, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16821
      839c8ecf
    • S
      Fix data race against logging data structure because of LogBuffer · bd45633b
      sdong 提交于
      Summary:
      @igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.
      
      With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.
      
      Thanks @igor for this finding!
      
      Test Plan: make all check
      
      Reviewers: haobo, igor
      
      Reviewed By: haobo
      
      CC: dhruba, ljin, yhchiang, igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16767
      bd45633b
    • S
      Temp Fix of LogBuffer flushing · 6c66bc08
      sdong 提交于
      Summary: To temp fix the log buffer flushing. Flush the buffer inside the lock. Clean the trunk before we find an eventual fix.
      
      Test Plan: make all check
      
      Reviewers: haobo, igor
      
      Reviewed By: igor
      
      CC: ljin, leveldb, yhchiang
      
      Differential Revision: https://reviews.facebook.net/D16791
      6c66bc08
    • I
      Add a comment after SignalAll() · cb980216
      Igor Canadi 提交于
      Summary: Having code after SignalAll has already caused 2 bugs. Let's make sure this doesn't happen again.
      
      Test Plan: no test
      
      Reviewers: sdong, dhruba, haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16785
      cb980216
  8. 11 3月, 2014 7 次提交
  9. 08 3月, 2014 1 次提交
  10. 07 3月, 2014 1 次提交
  11. 06 3月, 2014 5 次提交
    • S
      Buffer info logs when picking compactions and write them out after releasing the mutex · ecb1ffa2
      sdong 提交于
      Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.
      
      Test Plan:
      make all check
      check the log lines while running some tests that trigger compactions.
      
      Reviewers: haobo, igor, dhruba
      
      Reviewed By: dhruba
      
      CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-
      
      Differential Revision: https://reviews.facebook.net/D16515
      ecb1ffa2
    • L
      output perf_context in db_bench readrandom · 04298f8c
      Lei Jin 提交于
      Summary:
      Add helper function to print perf context data in db_bench if enabled.
      I didn't find any code that actually exports perf context data. Not sure
      if I missed anything
      
      Test Plan: ran db_bench
      
      Reviewers: haobo, sdong, igor
      
      Reviewed By: igor
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16575
      04298f8c
    • L
      fix db_bench to use HashSkipList for real · 64138b5d
      Lei Jin 提交于
      Summary:
      For HashSkipList case, DBImpl has sanity check to see if prefix_extractor in
      options is the same as the one in memtable factory. If not, it falls
      back to SkipList. As result, I was experimenting with SkipList
      performance. No wonder it is much worse than LinkedList
      
      Test Plan: ran benchmark
      
      Reviewers: haobo, sdong, igor
      
      Reviewed By: igor
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16569
      64138b5d
    • L
      config max_background_flushes in db_bench · 51560ba7
      Lei Jin 提交于
      Summary: as title
      
      Test Plan: make release
      
      Reviewers: haobo, sdong, igor
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16437
      51560ba7
    • I
      MergingIterator assertion · c0ccf436
      Igor Canadi 提交于
      Summary: I wrote a test that triggers assertion in MergingIterator. I have not touched that code ever, so I'm looking for somebody with good understanding of the MergingIterator code to fix this. The solution is probably a one-liner. Let me know if you're willing to take a look.
      
      Test Plan: This test fails with an assertion `use_heap_ == false`
      
      Reviewers: dhruba, haobo, sdong, kailiu
      
      Reviewed By: sdong
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16521
      c0ccf436
  12. 05 3月, 2014 2 次提交
  13. 04 3月, 2014 2 次提交
    • K
      Add a hash-index component for block · 906f3dca
      kailiu 提交于
      Summary:
      this is the key component extracted from diff: https://reviews.facebook.net/D14271
      I separate it to a dedicated patch to make the review easier.
      
      Test Plan: added a unit test and passed it.
      
      Reviewers: haobo, sdong, dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16245
      906f3dca
    • I
      Fix a group commit bug in LogAndApply · 5142b370
      Igor Canadi 提交于
      Summary:
      EncodeTo(&record) does not overwrite, it appends to it.
      
      This means that group commit log and apply will look something like:
      record1
      record1record2
      record1record2record3
      
      I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.
      
      This bug turned up in column family work, where opening a database failed with "adding a same column family twice".
      
      Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16461
      5142b370
  14. 01 3月, 2014 2 次提交
    • K
      Remove the terrible hack in for flush_block_policy_factory · bf86af51
      kailiu 提交于
      Summary:
      Previous code is too convoluted and I must be drunk for letting
      such code to be written without a second thought.
      
      Thanks to the discussion with @sdong, I added the `Options` when
      generating the flusher, thus avoiding the tricks.
      
      Just FYI: I resisted to add Options in flush_block_policy.h since I
      wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
      and Options also depends FlushBlockPolicy... While I appreciate my
      effort to prevent it, the old design turns out creating more troubles than
      it tried to avoid.
      
      Test Plan: ran ./table_test
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      CC: sdong, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16503
      bf86af51
    • I
      Make Log::Reader more robust · 58ca641d
      Igor Canadi 提交于
      Summary:
      This diff does two things:
      (1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
      (2) Turn off mmap writes for all writes to log and manifest files
      
      (2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.
      
      Test Plan:
      Added unit tests from LevelDB
      Actually recovered a "corrupted" MANIFEST file.
      
      Reviewers: dhruba, haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16119
      58ca641d