1. 24 1月, 2014 2 次提交
    • L
      CompactRange() to return status · aba2acb5
      Lei Jin 提交于
      Summary: as title
      
      Test Plan:
      make all check
      What else tests shall I cover?
      
      Reviewers: igor, haobo
      
      CC:
      
      Differential Revision: https://reviews.facebook.net/D15339
      aba2acb5
    • T
      Tailing iterator · 81c9cc9b
      Tomislav Novak 提交于
      Summary:
      This diff implements a special type of iterator that doesn't create a snapshot
      (can be used to read newly inserted data) and is optimized for doing sequential
      reads.
      
      TailingIterator uses current superversion number to determine whether to
      invalidate its internal iterators. If the version hasn't changed, it can often
      avoid doing expensive seeks over immutable structures (sst files and immutable
      memtables).
      
      Test Plan:
      * new unit tests
      * running LD with this patch
      
      Reviewers: igor, dhruba, haobo, sdong, kailiu
      
      Reviewed By: sdong
      
      CC: leveldb, lovro, march
      
      Differential Revision: https://reviews.facebook.net/D15285
      81c9cc9b
  2. 23 1月, 2014 2 次提交
    • I
      Unfriending classes · fb01755a
      Igor Canadi 提交于
      Summary:
      In this diff I made some effort to reduce usage of friending. To do that, I had to expose Compaction::inputs_ through a method inputs(). Not sure if this is a good idea, there is a trade-off. I think it's less confusing than having lots of friends.
      
      I also thought about other friendship relationships, but they are too much tangled at this point. Once you friend two classes, it's very hard to unfriend them :)
      
      Test Plan: make check
      
      Reviewers: haobo, kailiu, sdong, dhruba
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15267
      fb01755a
    • I
      Refactor Recover() code · 6fe9b577
      Igor Canadi 提交于
      Summary:
      This diff does two things:
      * Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
      * Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
      1. Recover log 5, add new sst files to edit
      2. Recover log 7, add new sst files to edit
      3. Recover log 8, add new sst files to edit
      4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
      After the change, we'll do:
      1. Recover log 5, commit the new sst files and set log 5 as recovered
      2. Recover log 7, commit the new sst files and set log 7 as recovered
      3. Recover log 8, commit the new sst files and set log 8 as recovered
      
      The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.
      
      [1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, kailiu, sdong
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15237
      6fe9b577
  3. 18 1月, 2014 4 次提交
    • M
      Boost access before mutex is unlocked · 4e8321bf
      Mark Callaghan 提交于
      Summary:
      This moves the use of versions_ to before the mutex is unlocked
      to avoid a possible race.
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      make check
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: haobo, dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15279
      4e8321bf
    • I
      Statistics code cleanup · 83681bf9
      Igor Canadi 提交于
      Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.
      
      Test Plan: make check
      
      Reviewers: haobo, kailiu, sdong, dhruba, tnovak
      
      Reviewed By: tnovak
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15099
      83681bf9
    • I
      Fix SIGSEGV in compaction picker · 0f4a75b7
      Igor Canadi 提交于
      Summary:
      The SIGSEGV was introduced by https://reviews.facebook.net/D15171
      
      I also fixed ExpandWhileOverlapping() which returned the failure by setting it's own stack variable to nullptr (!). This bug is present in 2.6 release, so I guess ExpandWhileOverlapping never fails :)
      
      Test Plan: `make check`. Also MarkCallaghan confirmed it fixed the SIGSEGV he reported.
      
      Reviewers: MarkCallaghan, kailiu, sdong, dhruba, haobo
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15261
      0f4a75b7
    • M
      Fix SlowdownAmount · 439e36db
      Mark Callaghan 提交于
      Summary:
      This had a few bugs.
      1) bottom and top were reversed. top is for the max value but the callers were passing the max
      value to bottom. The result is that the max sleep is used when n >= bottom.
      2) one of the callers passed values with type double and these values are frequently between
      1.0 and 2.0 so rounding will do some bad things
      3) sometimes the function returned 0 when there should be a stall
      
      With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.
      
      With the fix.
      Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
      Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown
      
      Without the fix.
      Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
      Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench for --benchmarks=overwrite with IO-bound database
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15243
      439e36db
  4. 17 1月, 2014 2 次提交
    • I
      Remove compaction pointers · 6d6fb709
      Igor Canadi 提交于
      Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, kailiu, sdong
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15225
      6d6fb709
    • I
      CompactionPicker · c699c84a
      Igor Canadi 提交于
      Summary:
      This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.
      
      To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.
      
      In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.
      
      This diff depends on D15171, D15183, D15189 and D15201
      
      Test Plan: `make check`
      
      Reviewers: kailiu, sdong, dhruba, haobo
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15207
      c699c84a
  5. 16 1月, 2014 5 次提交
    • K
      Remove the unnecessary use of shared_ptr · eae1804f
      kailiu 提交于
      Summary:
      shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
      In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.
      
      According to igor's previous work, we are likely to make quite some performance gain from this diff.
      
      Test Plan: make check
      
      Reviewers: dhruba, igor, sdong, haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15213
      eae1804f
    • I
      Move more functions from VersionSet to Version · 787f11bb
      Igor Canadi 提交于
      Summary:
      This moves functions:
      * VersionSet::Finalize() -> Version::UpdateCompactionStats()
      * VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize()
      
      The diff depends on D15189, D15183 and D15171
      
      Test Plan: make check
      
      Reviewers: kailiu, sdong, haobo, dhruba
      
      Reviewed By: sdong
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15201
      787f11bb
    • I
      Moving Compaction class to separate header file · 615d1ea2
      Igor Canadi 提交于
      Summary:
      I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.
      
      The diff depends on D15171 and D15183
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, kailiu, sdong
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15189
      615d1ea2
    • I
      Move functions from VersionSet to Version · 2f4eda78
      Igor Canadi 提交于
      Summary:
      There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.
      
      The functions moved are:
      * NumLevelBytes
      * LevelSummary
      * LevelFileSummary
      * MaxNextLevelOverlappingBytes
      * AddLiveFiles (previously AddLiveFilesCurrentVersion())
      * NeedSlowdownForNumLevel0Files
      
      The diff continues on (and depends on) D15171
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, kailiu, sdong, emayanke
      
      Reviewed By: sdong
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15183
      2f4eda78
    • I
      Decrease reliance on VersionSet::NumberLevels() · 65a8a52b
      Igor Canadi 提交于
      Summary:
      With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()
      
      This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.
      
      I have also slightly changed how VersionSet keeps track of manifest size.
      
      This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.
      
      In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_
      
      Test Plan: make check
      
      Reviewers: haobo, dhruba, kailiu, sdong
      
      Reviewed By: sdong
      
      Differential Revision: https://reviews.facebook.net/D15171
      65a8a52b
  6. 15 1月, 2014 10 次提交
    • S
      [RocksDB Performance Branch] DBImpl.NewInternalIterator() to reduce works inside mutex · 9b51af5a
      Siying Dong 提交于
      Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.
      
      Test Plan:
      make all check
      will run db_stress for a while too to make sure no problem.
      
      Reviewers: haobo, dhruba, kailiu
      
      Reviewed By: haobo
      
      CC: igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D14589
      
      Conflicts:
      	db/db_impl.cc
      9b51af5a
    • I
      Fix CompactRange to apply filter to every key · d9cd7a06
      Igor Canadi 提交于
      Summary:
      When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.
      
      This patch fixed the unit test.
      
      Test Plan: Added a failing unit test and a fix, so it's not failing anymore.
      
      Reviewers: dhruba, haobo, sdong
      
      Reviewed By: haobo
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D14421
      d9cd7a06
    • I
      1ed2404f
    • I
      Fix test · 62910202
      Igor Canadi 提交于
      62910202
    • I
      Fix memtable construction in tests · 7f3e417f
      Igor Canadi 提交于
      7f3e417f
    • I
      VersionEdit not to take NumLevels() · 055e6df4
      Igor Canadi 提交于
      Summary:
      I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.
      
      Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.
      
      This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, sdong, kailiu
      
      Reviewed By: kailiu
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15159
      055e6df4
    • I
      BuildBatchGroup -- memcpy outside of lock · 7d9f21cf
      Igor Canadi 提交于
      Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.
      
      Test Plan:
      `make check`
      
      I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.
      
      Reviewers: dhruba, haobo, kailiu
      
      Reviewed By: haobo
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D15063
      7d9f21cf
    • N
      Use sanitized options while opening db · 1d9bac4d
      Naman Gupta 提交于
      Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture.
      
      Test Plan: make check
      
      Reviewers: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D14103
      1d9bac4d
    • S
      Pre-calculate whether to slow down for too many level 0 files · fbbf0d14
      Siying Dong 提交于
      Summary: Currently in DBImpl::MakeRoomForWrite(), we do  "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.
      
      Test Plan:
      make all check
      Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.
      
      Reviewers: haobo, kailiu, igor
      
      Reviewed By: kailiu
      
      CC: nkg-, leveldb
      
      Differential Revision: https://reviews.facebook.net/D15141
      fbbf0d14
    • S
      DB::Put() to estimate write batch data size needed and pre-allocate buffer · 51dd2192
      Siying Dong 提交于
      Summary:
      In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.
      
      Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.
      
      Test Plan: make all check
      
      Reviewers: haobo, kailiu, igor
      
      Reviewed By: haobo
      
      CC: leveldb, nkg-
      
      Differential Revision: https://reviews.facebook.net/D15135
      51dd2192
  7. 12 1月, 2014 1 次提交
  8. 11 1月, 2014 1 次提交
    • S
      Improve RocksDB "get" performance by computing merge result in memtable · a09ee106
      Schalk-Willem Kruger 提交于
      Summary:
      Added an option (max_successive_merges) that can be used to specify the
      maximum number of successive merge operations on a key in the memtable.
      This can be used to improve performance of the "get" operation. If many
      successive merge operations are performed on a key, the performance of "get"
      operations on the key deteriorates, as the value has to be computed for each
      "get" operation by applying all the successive merge operations.
      
      FB Task ID: #3428853
      
      Test Plan:
      make all check
      db_bench --benchmarks=readrandommergerandom
      counter_stress_test
      
      Reviewers: haobo, vamsi, dhruba, sdong
      
      Reviewed By: haobo
      
      CC: zshao
      
      Differential Revision: https://reviews.facebook.net/D14991
      a09ee106
  9. 09 1月, 2014 1 次提交
  10. 08 1月, 2014 2 次提交
    • M
      Don't always compress L0 files written by memtable flush · 50994bf6
      Mark Callaghan 提交于
      Summary:
      Code was always compressing L0 files written by a memtable flush
      when compression was enabled. Now this is done when
      min_level_to_compress=0 for leveled compaction and when
      universal_compaction_size_percent=-1 for universal compaction.
      
      Task ID: #3416472
      
      Blame Rev:
      
      Test Plan:
      ran db_bench with compression options
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba, igor, sdong
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D14757
      50994bf6
    • T
      Fix a deadlock in CompactRange() · 9f690ec6
      Tomislav Novak 提交于
      Summary:
      The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
      can cause it to deadlock when CompactRange() is called concurrently from
      multiple threads. Imagine a following scenario with only two threads
      (max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):
      
         1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
            bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
            (bg_compaction_scheduled_ is now 10) and waits for it to complete.
         2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
            (now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
            drop to LargeNumber.
         3. BG thread completes the first manual compaction, decrements
            bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
            Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
            (now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
            thread #2 also goes to sleep (waiting on bg_cv_), without resetting
            bg_compaction_scheduled_.
      
      This diff attempts to address the problem by introducing a new counter
      bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
      schedule manual compactions).
      
      Test Plan:
      I could pretty much consistently reproduce the deadlock with a program that
      calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
      threads. This no longer happens with this patch.
      
      Tests (make check) pass.
      
      Reviewers: dhruba, igor, sdong, haobo
      
      Reviewed By: igor
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D14799
      9f690ec6
  11. 03 1月, 2014 1 次提交
  12. 02 1月, 2014 1 次提交
    • I
      Support multi-threaded DisableFileDeletions() and EnableFileDeletions() · b60c14f6
      Igor Canadi 提交于
      Summary:
      We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().
      
      However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, sanketh
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D14781
      b60c14f6
  13. 01 1月, 2014 1 次提交
  14. 27 12月, 2013 3 次提交
  15. 21 12月, 2013 1 次提交
    • I
      [RocksDB] Optimize locking for Get · 1fdb3f7d
      Igor Canadi 提交于
      Summary:
      Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.
      
      Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:
      
        overwrite                    56345  ->   63001
        fillseq                      193730 ->  185296
        readrandom                   771301 -> 1219803 (58% improvement!)
        readrandom_smallblockcache   677609 ->  862850
        readrandom_memtable_sst      710440 -> 1109223
        readrandom_fillunique_random 221589 ->  247869
        memtablefillrandom           105286 ->   92643
        memtablereadrandom           763033 -> 1288862
      
      Test Plan:
      make asan_check
      I am also running db_stress
      
      Reviewers: dhruba, haobo, sdong, kailiu
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D14679
      1fdb3f7d
  16. 19 12月, 2013 1 次提交
    • M
      Add 'readtocache' test · ca92068b
      Mark Callaghan 提交于
      Summary:
      For some tests I want to cache the database prior to running other tests on the same invocation
      of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
      and it will still do a full read of --num rows with one thread. It might be invoked like:
        db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D14739
      ca92068b
  17. 18 12月, 2013 2 次提交