1. 08 11月, 2014 3 次提交
    • I
      Get rid of mutex in CompactionJob's state · e3d3567b
      Igor Canadi 提交于
      Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply()
      
      Test Plan: make check
      
      Reviewers: ljin, yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: sdong, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28491
      e3d3567b
    • Y
      CompactFiles, EventListener and GetDatabaseMetaData · 28c82ff1
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff adds three sets of APIs to RocksDB.
      
      = GetColumnFamilyMetaData =
      * This APIs allow users to obtain the current state of a RocksDB instance on one column family.
      * See GetColumnFamilyMetaData in include/rocksdb/db.h
      
      = EventListener =
      * A virtual class that allows users to implement a set of
        call-back functions which will be called when specific
        events of a RocksDB instance happens.
      * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
      
      = CompactFiles =
      * CompactFiles API inputs a set of file numbers and an output level, and RocksDB
        will try to compact those files into the specified level.
      
      = Example =
      * Example code can be found in example/compact_files_example.cc, which implements
        a simple external compactor using EventListener, GetColumnFamilyMetaData, and
        CompactFiles API.
      
      Test Plan:
      listener_test
      compactor_test
      example/compact_files_example
      export ROCKSDB_TESTS=CompactFiles
      db_test
      export ROCKSDB_TESTS=MetaData
      db_test
      
      Reviewers: ljin, igor, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: MarkCallaghan, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D24705
      28c82ff1
    • I
      Redesign pending_outputs_ · 53af5d87
      Igor Canadi 提交于
      Summary:
      Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it.
      
      Still have to write some comments, but should be good enough to start the discussion.
      
      Test Plan: make check, will also run stress test
      
      Reviewers: ljin, sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28353
      53af5d87
  2. 05 11月, 2014 1 次提交
  3. 01 11月, 2014 2 次提交
  4. 30 10月, 2014 1 次提交
    • S
      Make CompactionPicker more easily tested · 76d1c28e
      sdong 提交于
      Summary:
      Make compaction picker easier to test.
      The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.
      
      It now passes most tests. Still two things need to be done:
      (1) deal with the FIFO compaction's file size.
      (2) write an example test to make sure the interface can do the job.
      
      Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.
      
      Test Plan:
      Pass all unit tests and compaction_picker_test
      
      Reviewers: yhchiang, rven, igor, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D27639
      76d1c28e
  5. 29 10月, 2014 6 次提交
  6. 18 10月, 2014 1 次提交
    • Y
      Speed up DB::Open() and Version creation by limiting the number of FileMetaData initialization. · 6c669186
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff speeds up DB::Open() and Version creation by limiting the number of FileMetaData initialization. The behavior of Version::UpdateAccumulatedStats() is changed as follows:
      
      * It only initializes the first 20 uninitialized FileMetaData from file.  This guarantees the size of the latest 20 files will always be compensated when they have any deletion entries.  Previously it may initialize all FileMetaData by loading all files at DB::Open().
      * In case none the first 20 files has any data entry, UpdateAccumulatedStats() will initialize the FileMetaData of the oldest file.
      
      Test Plan: db_test
      
      Reviewers: igor, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D24255
      6c669186
  7. 02 10月, 2014 1 次提交
    • L
      make compaction related options changeable · 5ec53f3e
      Lei Jin 提交于
      Summary:
      make compaction related options changeable. Most of changes are tedious,
      following the same convention: grabs MutableCFOptions at the beginning
      of compaction under mutex, then pass it throughout the job and register
      it in SuperVersion at the end.
      
      Test Plan: make all check
      
      Reviewers: igor, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23349
      5ec53f3e
  8. 30 9月, 2014 1 次提交
    • L
      use GetContext to replace callback function pointer · 2faf49d5
      Lei Jin 提交于
      Summary:
      Intead of passing callback function pointer and its arg on Table::Get()
      interface, passing GetContext. This makes the interface cleaner and
      possible better perf. Also adding a fast pass for SaveValue()
      
      Test Plan: make all check
      
      Reviewers: igor, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D24057
      2faf49d5
  9. 26 9月, 2014 1 次提交
    • L
      CompactedDBImpl · 3c680061
      Lei Jin 提交于
      Summary:
      Add a CompactedDBImpl that will enabled when calling OpenForReadOnly()
      and the DB only has one level (>0) of files. As a performan comparison,
      CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with
      ReadOnlyDBImpl.
      
      Test Plan: db_bench
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23553
      3c680061
  10. 24 9月, 2014 1 次提交
  11. 09 9月, 2014 2 次提交
    • L
      rename version_set options_ to db_options_ to avoid confusion · 9b0f7ffa
      Lei Jin 提交于
      Summary: as title
      
      Test Plan: make release
      
      Reviewers: sdong, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23007
      9b0f7ffa
    • I
      Push- instead of pull-model for managing Write stalls · a2bb7c3c
      Igor Canadi 提交于
      Summary:
      Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
      * proceed with all writes without delay
      * delay all writes by fixed time
      * stop all writes
      
      The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
      
      When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
      
      This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
      
      Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
      
      Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D22791
      a2bb7c3c
  12. 05 9月, 2014 1 次提交
  13. 07 8月, 2014 1 次提交
    • S
      Add DB property "rocksdb.estimate-table-readers-mem" · 1242bfca
      sdong 提交于
      Summary:
      Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache.
      
      Refactor the property codes to allow getting property from a version, with DB mutex not acquired.
      
      Test Plan: Add several checks of this new property in existing codes for various cases.
      
      Reviewers: yhchiang, ljin
      
      Reviewed By: ljin
      
      Subscribers: xjin, igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D20733
      1242bfca
  14. 29 7月, 2014 1 次提交
    • S
      Add DB property estimated number of keys · f6784766
      sdong 提交于
      Summary: Add a DB property of estimated number of live keys, by adding number of entries of all mem tables and all files, subtracted by all deletions in all files.
      
      Test Plan: Add the case in unit tests
      
      Reviewers: hobbymanyp, ljin
      
      Reviewed By: ljin
      
      Subscribers: MarkCallaghan, yoshinorim, leveldb, igor, dhruba
      
      Differential Revision: https://reviews.facebook.net/D20631
      f6784766
  15. 22 7月, 2014 1 次提交
    • L
      make internal stats independent of statistics · f6f1533c
      Lei Jin 提交于
      Summary:
      also make it aware of column family
      output from db_bench
      
      ```
      ** Compaction Stats [default] **
      Level Files Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        L0    14      956   0.9      0.0     0.0      0.0       2.7      2.7    0.0   0.0      0.0    111.6        0         0         0         0         24        40    0.612      75.20     492387    0.15
        L1    21     2001   2.0      5.7     2.0      3.7       5.3      1.6    5.4   2.6     71.2     65.7       31        43        55        12         82         2   41.242      43.72      41183    1.06
        L2   217    18974   1.9     16.5     2.0     14.4      15.1      0.7   15.6   7.4     70.1     64.3       17       182       185         3        241        16   15.052       0.00          0    0.00
        L3  1641   188245   1.8      9.1     1.1      8.0       8.5      0.5   15.4   7.4     61.3     57.2        9        75        76         1        152         9   16.887       0.00          0    0.00
        L4  4447   449025   0.4     13.4     4.8      8.6       9.1      0.5    4.7   1.9     77.8     52.7       38        79       100        21        176        38    4.639       0.00          0    0.00
       Sum  6340   659201   0.0     44.7    10.0     34.7      40.6      6.0   32.0  15.2     67.7     61.6       95       379       416        37        676       105    6.439     118.91     533570    0.22
       Int     0        0   0.0      1.2     0.4      0.8       1.3      0.5    5.2   2.7     59.1     65.6        3         7         9         2         20        10    2.003       0.00          0    0.00
      Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown
      Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown
      
      ** DB Stats **
      Uptime(secs): 202.1 total, 13.5 interval
      Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB
      Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written
      Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB
      Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written
      
      Test Plan: ran it
      
      Reviewers: sdong, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19917
      f6f1533c
  16. 17 7月, 2014 1 次提交
    • F
      store file_indexer info in sequential memory · c11d604a
      Feng Zhu 提交于
      Summary:
        use arena to allocate space for next_level_index_ and level_rb_
        Thus increasing data locality and make Version::Get faster.
      
      Benchmark detail
      Base version: commit d2a727c1
      
      command used:
      ./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=2097152 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=fillseq, readrandom,readrandom,readrandom --use_existing_db=0 --num=52428800 --threads=1
      
      Result:
      cpu running percentage:
      Version::Get, improved from 7.98% to 7.42%
      FileIndexer::GetNextLevelIndex, improved from 1.18% to 0.68%.
      
      Test Plan:
        make all check
      
      Reviewers: ljin, haobo, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, igor
      
      Differential Revision: https://reviews.facebook.net/D19845
      c11d604a
  17. 12 7月, 2014 1 次提交
    • F
      use FileLevel in LevelFileNumIterator · 178fd6f9
      Feng Zhu 提交于
      Summary:
        Use FileLevel in LevelFileNumIterator, thus use new version of findFile.
        Old version of findFile function is deleted.
        Write a function in version_set.cc to generate FileLevel from files_.
        Add GenerateFileLevelTest in version_set_test.cc
      
      Test Plan:
        make all check
      
      Reviewers: ljin, haobo, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: igor, dhruba
      
      Differential Revision: https://reviews.facebook.net/D19659
      178fd6f9
  18. 10 7月, 2014 2 次提交
    • F
      create compressed_levels_ in Version, allocate its space using arena. Make... · f697cad1
      Feng Zhu 提交于
      create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
      
      Summary:
          Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
          Thus increase the file meta data locality, speed up "Get" and "FindFile"
      
          benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
      
      benchmark command:
      ./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
      
      Read Random:
      From 1.8363 ms/op, improve to 1.7587 ms/op.
      Read while writing:
      From 2.985 ms/op, improve to 2.924 ms/op.
      
      Test Plan:
          make all check
      
      Reviewers: ljin, haobo, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, igor
      
      Differential Revision: https://reviews.facebook.net/D19419
      f697cad1
    • Y
      Some fixes on size compensation logic for deletion entry in compaction · 70828557
      Yueh-Hsuan Chiang 提交于
      Summary:
      This patch include two fixes:
      1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version.
      2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size.
      
      Test Plan:
      export ROCKSDB_TESTS=CompactionDele
      ./db_test
      
      Reviewers: ljin, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19557
      70828557
  19. 03 7月, 2014 1 次提交
  20. 01 7月, 2014 1 次提交
    • I
      No need for files_by_size_ in universal compaction · a2e0d890
      Igor Canadi 提交于
      Summary: files_by_size_ is sorted by time in case of universal compaction. However, Version::files_ is also sorted by time. So no need for files_by_size_
      
      Test Plan:
      1) make check with the change
      2) make check with `assert(last_index == c->input_version_->files_[level].size() - 1);` in compaction picker
      
      Reviewers: dhruba, haobo, yhchiang, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19125
      a2e0d890
  21. 25 6月, 2014 1 次提交
    • Y
      Allow compaction to reclaim storage more effectively. · e813f5b6
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff allows compaction to reclaim storage more effectively.
      In the current design, compactions are mainly triggered based on
      the file sizes.  However, since deletion entries does not have
      value, files which have many deletion entries are less likely
      to be compacted.  As a result, it may took a while to make
      deletion entries to be compacted.
      
      This diff address issue by compensating the size of deletion
      entries during compaction process: the size of each deletion
      entry in the compaction process is augmented by 2x average
      value size.  The diff applies to both leveled and universal
      compacitons.
      
      Test Plan:
      develop CompactionDeletionTrigger
      make db_test
      ./db_test
      
      Reviewers: haobo, igor, ljin, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19029
      e813f5b6
  22. 20 6月, 2014 1 次提交
    • I
      Remove seek compaction · d4a84233
      Igor Canadi 提交于
      Summary:
      As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
      
      This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
      
      There is one test case that relied on didIO information. I'll try to find another way to implement it.
      
      Test Plan: make check
      
      Reviewers: sdong, haobo, yhchiang, ljin, dhruba
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19161
      d4a84233
  23. 14 6月, 2014 1 次提交
  24. 13 6月, 2014 1 次提交
  25. 03 6月, 2014 1 次提交
    • S
      In DB::NewIterator(), try to allocate the whole iterator tree in an arena · df9069d2
      sdong 提交于
      Summary:
      In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
      1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
      2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
      3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
      
      Limitations:
      (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
      (2) Two level iterator itself is allocated in arena, but not iterators inside it.
      
      Test Plan: make all check
      
      Reviewers: ljin, haobo
      
      Reviewed By: haobo
      
      Subscribers: leveldb, dhruba, yhchiang, igor
      
      Differential Revision: https://reviews.facebook.net/D18513
      df9069d2
  26. 31 5月, 2014 1 次提交
    • L
      forward iterator · 388d2054
      Lei Jin 提交于
      Summary:
      Forward iterator puts everything together in a flat structure instead of
      a hierarchy of nested iterators. this should simplify the code and
      provide better performance. It also enables more optimization since all
      information are accessiable in one place.
      Init evaluation shows about 6% improvement
      
      Test Plan: db_test and db_bench
      
      Reviewers: dhruba, igor, tnovak, sdong, haobo
      
      Reviewed By: haobo
      
      Subscribers: sdong, leveldb
      
      Differential Revision: https://reviews.facebook.net/D18795
      388d2054
  27. 22 5月, 2014 1 次提交
    • I
      FIFO compaction style · 6de6a066
      Igor Canadi 提交于
      Summary:
      Introducing new compaction style -- FIFO.
      
      FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
      
      FIFO compaction style is suited for storing high-frequency event logs.
      
      Test Plan: Added a unit test
      
      Reviewers: dhruba, haobo, sdong
      
      Reviewed By: dhruba
      
      Subscribers: alberts, leveldb
      
      Differential Revision: https://reviews.facebook.net/D18765
      6de6a066
  28. 27 4月, 2014 1 次提交
  29. 26 4月, 2014 1 次提交
  30. 22 4月, 2014 1 次提交
    • L
      hints for narrowing down FindFile range and avoiding checking unrelevant L0 files · 0f2d7681
      Lei Jin 提交于
      Summary:
      The file tree structure in Version is prebuilt and the range of each file is known.
      On the Get() code path, we do binary search in FindFile() by comparing
      target key with each file's largest key and also check the range for each L0 file.
      With some pre-calculated knowledge, each key comparision that has been done can serve
      as a hint to narrow down further searches:
      (1) If a key falls within a L0 file's range, we can safely skip the next
      file if its range does not overlap with the current one.
      (2) If a key falls within a file's range in level L0 - Ln-1, we should only
      need to binary search in the next level for files that overlap with the current one.
      
      (1) will be able to skip some files depending one the key distribution.
      (2) can greatly reduce the range of binary search, especially for bottom
      levels, given that one file most likely only overlaps with N files from
      the level below (where N is max_bytes_for_level_multiplier). So on level
      L, we will only look at ~N files instead of N^L files.
      
      Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
      improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
      9.6M/s), it gives us ~13% improvement.
      
      Test Plan: make all check
      
      Reviewers: haobo, igor, dhruba, sdong, yhchiang
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D17205
      0f2d7681