1. 07 4月, 2015 1 次提交
    • S
      A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge · 953a885e
      sdong 提交于
      Summary:
      Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
      
      Also refactor the codes so that
      (1) make table property collector and internal table property collector two separate data structures with the later one now exposed
      (2) table builders only receive internal table properties
      
      Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
      
      Reviewers: yhchiang, igor.sugak, rven, igor
      
      Reviewed By: rven, igor
      
      Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D35373
      953a885e
  2. 31 3月, 2015 1 次提交
    • S
      Universal Compactions with Small Files · b23bbaa8
      sdong 提交于
      Summary:
      With this change, we use L1 and up to store compaction outputs in universal compaction.
      The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.
      
      If options.num_levels=1, it behaves all the same as now.
      
      Test Plan:
      1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
      2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
      3) add unit test to migrate from multiple level setting back to one level setting
      4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.
      
      Reviewers: rven, kradhakrishnan, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: meyering, leveldb, MarkCallaghan, dhruba
      
      Differential Revision: https://reviews.facebook.net/D34539
      b23bbaa8
  3. 15 3月, 2015 1 次提交
    • M
      Stop printing per-level stall times. · c8da6703
      Mark Callaghan 提交于
      Summary:
      Per-level stall times are the suggested stall time, not the actual stall time so this change stops printing them
      both in the per-level output lines and in the summary. Also changed output for total stall time to include units
      in all cases. The new output looks like:
      Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)    RecordIn   RecordDrop
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        L0     4/1          7   0.8      0.0     0.0      0.0       0.6      0.6       0.0   0.0      0.0     12.9        50       352    0.141        882            0            0
        L1     5/0          9   0.9      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
        L2    54/0         99   1.0      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
        L3   289/0        527   0.5      0.0     0.0      0.0       0.0      0.0       0.5   0.0      0.0      0.0         0         0    0.000          0            0            0
       Sum   352/1        642   0.0      0.0     0.0      0.0       0.6      0.6       1.7   1.0      0.0     12.9        50       352    0.141        882            0            0
       Int     0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     15.5         0         3    0.118          7            0            0
      Flush(GB): accumulative 0.627, interval 0.005
      Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 882 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard
      
      Task ID: #6493861
      
      Blame Rev:
      
      Test Plan:
      run db_bench, look at output
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D35085
      c8da6703
  4. 03 3月, 2015 1 次提交
    • I
      options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size... · db037393
      Igor Canadi 提交于
      options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
      
      Summary:
      When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
      
      In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
      
      Test Plan: New unit tests and pass tests suites including valgrind.
      
      Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
      
      Reviewed By: ikabiljo
      
      Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31437
      db037393
  5. 24 2月, 2015 1 次提交
  6. 20 2月, 2015 1 次提交
  7. 13 2月, 2015 1 次提交
    • I
      Introduce job_id for flush and compaction · e7ea51a8
      Igor Canadi 提交于
      Summary:
      It would be good to assing background job their IDs. Two benefits:
      1) makes LOGs more readable
      2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize
      
      Test Plan: ran rocksdb, read the LOG
      
      Reviewers: sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31617
      e7ea51a8
  8. 05 2月, 2015 1 次提交
    • Y
      Add a counter for collecting the wait time on db mutex. · 181191a1
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add a counter for collecting the wait time on db mutex.
      Also add MutexWrapper and CondVarWrapper for measuring wait time.
      
      Test Plan:
      ./db_test
      export ROCKSDB_TESTS=MutexWaitStats
      ./db_test
      
      verify stats output using db_bench
      make clean
      make release
      ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10
      
      Sample output:
          rocksdb.db.mutex.wait.micros COUNT : 7546866
      
      Reviewers: MarkCallaghan, rven, sdong, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32787
      181191a1
  9. 28 1月, 2015 2 次提交
  10. 27 1月, 2015 1 次提交
    • I
      Fix data race #1 · f1c88624
      Igor Canadi 提交于
      Summary:
      This is first in a series of diffs that fixes data races detected by thread sanitizer.
      
      Here the problem is that we call Ref() on a column family during a single-threaded write, without holding a mutex.
      
      Test Plan: TSAN is no longer complaining about LevelLimitReopen.
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32121
      f1c88624
  11. 07 1月, 2015 1 次提交
    • I
      Simplify column family concurrency · 7731d51c
      Igor Canadi 提交于
      Summary:
      This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_.
      
      Before:
      * When mutating: lock DB mutex and spin lock
      * When reading: lock DB mutex OR spin lock
      
      After:
      * When mutating: lock DB mutex and be in write thread
      * When reading: lock DB mutex or be in write thread
      
      That way, we eliminate the spin lock that protects these hash maps and  simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps).
      
      With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread).
      
      Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free.
      
      This addresses task t5116919.
      
      Test Plan:
      make check
      
      Stress test with lots and lots of column family drop and create:
      
         time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30651
      7731d51c
  12. 20 12月, 2014 1 次提交
    • I
      Rewritten system for scheduling background work · fdb6be4e
      Igor Canadi 提交于
      Summary:
      When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
      
      The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
      
      Here are the performance results:
      
      Command:
      
          ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000  --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
      
      Before the patch:
      
           fillrandom   :      26.950 micros/op 37105 ops/sec;    4.1 MB/s
      
      After the patch:
      
            fillrandom   :      17.404 micros/op 57456 ops/sec;    6.4 MB/s
      
      Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
      
            fillrandom   :       7.590 micros/op 131758 ops/sec;   14.6 MB/s
      
      Test Plan:
      make check
      
      two stress tests:
      
      Big number of compactions and flushes:
      
          ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
      
      max_background_flushes=0, to verify that this case also works correctly
      
          ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
      
      Reviewers: ljin, rven, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30123
      fdb6be4e
  13. 03 12月, 2014 2 次提交
    • J
      Enforce write buffer memory limit across column families · a14b7873
      Jonah Cohen 提交于
      Summary:
      Introduces a new class for managing write buffer memory across column
      families.  We supplement ColumnFamilyOptions::write_buffer_size with
      ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
      instance that enforces memory limits before flushing out to disk.
      
      Test Plan: Added SharedWriteBuffer unit test to db_test.cc
      
      Reviewers: sdong, rven, ljin, igor
      
      Reviewed By: igor
      
      Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D22581
      a14b7873
    • I
      Fix linters · 37d73d59
      Igor Canadi 提交于
      Summary:
      Two fixes:
      1. if cpplint is not present on the system, don't return a confusing error in the linter
      2. Add include_alpha, which means our includes should be sorted lexicographically
      
      Test Plan: Tried unsorting our includes, lint complained
      
      Reviewers: rven, ljin, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28845
      37d73d59
  14. 27 11月, 2014 1 次提交
  15. 19 11月, 2014 1 次提交
  16. 14 11月, 2014 2 次提交
  17. 08 11月, 2014 1 次提交
    • Y
      CompactFiles, EventListener and GetDatabaseMetaData · 28c82ff1
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff adds three sets of APIs to RocksDB.
      
      = GetColumnFamilyMetaData =
      * This APIs allow users to obtain the current state of a RocksDB instance on one column family.
      * See GetColumnFamilyMetaData in include/rocksdb/db.h
      
      = EventListener =
      * A virtual class that allows users to implement a set of
        call-back functions which will be called when specific
        events of a RocksDB instance happens.
      * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
      
      = CompactFiles =
      * CompactFiles API inputs a set of file numbers and an output level, and RocksDB
        will try to compact those files into the specified level.
      
      = Example =
      * Example code can be found in example/compact_files_example.cc, which implements
        a simple external compactor using EventListener, GetColumnFamilyMetaData, and
        CompactFiles API.
      
      Test Plan:
      listener_test
      compactor_test
      example/compact_files_example
      export ROCKSDB_TESTS=CompactFiles
      db_test
      export ROCKSDB_TESTS=MetaData
      db_test
      
      Reviewers: ljin, igor, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: MarkCallaghan, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D24705
      28c82ff1
  18. 07 11月, 2014 1 次提交
    • I
      Turn -Wshadow back on · 9f20395c
      Igor Canadi 提交于
      Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc.
      
      Test Plan: compiles
      
      Reviewers: ljin, yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28131
      9f20395c
  19. 05 11月, 2014 2 次提交
  20. 01 11月, 2014 2 次提交
    • I
      Turn on -Wshadow · 9f7fc3ac
      Igor Canadi 提交于
      Summary:
      ...and fix all the errors :)
      
      Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean.
      
      Test Plan: compiles
      
      Reviewers: yhchiang, rven, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27711
      9f7fc3ac
    • S
      Make VersionBuilder unit testable · 4d2ba38b
      sdong 提交于
      Summary:
      Rename Version::Builder to VersionBuilder and expose its definition to a header.
      Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo.
      Add version_builder_test which has a simple test.
      
      Test Plan: make all check
      
      Reviewers: rven, yhchiang, igor, ljin
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D27969
      4d2ba38b
  21. 30 10月, 2014 2 次提交
    • S
      Make CompactionPicker more easily tested · 76d1c28e
      sdong 提交于
      Summary:
      Make compaction picker easier to test.
      The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.
      
      It now passes most tests. Still two things need to be done:
      (1) deal with the FIFO compaction's file size.
      (2) write an example test to make sure the interface can do the job.
      
      Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.
      
      Test Plan:
      Pass all unit tests and compaction_picker_test
      
      Reviewers: yhchiang, rven, igor, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D27639
      76d1c28e
    • Y
      Apply InfoLogLevel to the logs in db/column_family.cc · 34d436b7
      Yueh-Hsuan Chiang 提交于
      Summary: Apply InfoLogLevel to the logs in db/column_family.cc
      
      Test Plan: make
      
      Reviewers: ljin, sdong, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27843
      34d436b7
  22. 29 10月, 2014 2 次提交
    • I
      FlushProcess · a39e931e
      Igor Canadi 提交于
      Summary:
      Abstract out FlushProcess and take it out of DBImpl.
      This also includes taking DeletionState outside of DBImpl.
      
      Currently this diff is only doing the refactoring. Future work includes:
      1. Decoupling flush_process.cc, make it depend on less state
      2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation
      
      Test Plan: make check
      
      Reviewers: rven, yhchiang, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27561
      a39e931e
    • L
      unfriend ColumnFamilyData from VersionSet · f981e081
      Lei Jin 提交于
      Summary: as title
      
      Test Plan:
      make release
      will run full test on all stacked diffs before committing
      
      Reviewers: sdong, yhchiang, rven, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27591
      f981e081
  23. 28 10月, 2014 1 次提交
    • L
      dynamic inplace_update options · f1841985
      Lei Jin 提交于
      Summary:
      Make inplace_update_support and inplace_update_num_locks dynamic.
      inplace_callback becomes immutable
      We are almost free of references to cfd->options() in db_impl
      
      Test Plan: unit test
      
      Reviewers: igor, yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D25293
      f1841985
  24. 17 10月, 2014 2 次提交
  25. 02 10月, 2014 1 次提交
    • L
      make compaction related options changeable · 5ec53f3e
      Lei Jin 提交于
      Summary:
      make compaction related options changeable. Most of changes are tedious,
      following the same convention: grabs MutableCFOptions at the beginning
      of compaction under mutex, then pass it throughout the job and register
      it in SuperVersion at the end.
      
      Test Plan: make all check
      
      Reviewers: igor, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23349
      5ec53f3e
  26. 25 9月, 2014 1 次提交
  27. 23 9月, 2014 1 次提交
    • S
      WriteBatchWithIndex to allow different Comparators for different column families · d0de413f
      sdong 提交于
      Summary:
      Previously, one single column family is given to WriteBatchWithIndex to index keys for all column families. An extra map from column family ID to comparator is maintained which can override the default comparator given in the constructor. A WriteBatchWithIndex::SetComparatorForCF() is added for user to add comparators per column family.
      
      Also move more codes into anonymous namespace.
      
      Test Plan: Add a unit test
      
      Reviewers: ljin, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb, yhchiang
      
      Differential Revision: https://reviews.facebook.net/D23355
      d0de413f
  28. 18 9月, 2014 1 次提交
  29. 11 9月, 2014 1 次提交
    • I
      Push model for flushing memtables · 3d9e6f77
      Igor Canadi 提交于
      Summary:
      When memtable is full it calls the registered callback. That callback then registers column family as needing the flush. Every write checks if there are some column families that need to be flushed. This completely eliminates the need for MakeRoomForWrite() function and simplifies our Write code-path.
      
      There is some complexity with the concurrency when the column family is dropped. I made it a bit less complex by dropping the column family from the write thread in https://reviews.facebook.net/D22965. Let me know if you want to discuss this.
      
      Test Plan: make check works. I'll also run db_stress with creating and dropping column families for a while.
      
      Reviewers: yhchiang, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23067
      3d9e6f77
  30. 09 9月, 2014 3 次提交
    • L
      MemTableOptions · 52311463
      Lei Jin 提交于
      Summary: removed reference to options in WriteBatch and DBImpl::Get()
      
      Test Plan: make all check
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23049
      52311463
    • I
      Check stop level trigger-0 before slowdown level-0 trigger · 2d57828d
      Igor Canadi 提交于
      Summary: ...
      
      Test Plan: Can't repro the test failure, but let's see what jenkins says
      
      Reviewers: zagfox, sdong, ljin
      
      Reviewed By: sdong, ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23061
      2d57828d
    • I
      Push- instead of pull-model for managing Write stalls · a2bb7c3c
      Igor Canadi 提交于
      Summary:
      Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
      * proceed with all writes without delay
      * delay all writes by fixed time
      * stop all writes
      
      The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
      
      When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
      
      This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
      
      Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
      
      Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D22791
      a2bb7c3c