1. 22 7月, 2014 2 次提交
    • S
      Allow user to specify DB path of output file of manual compaction · f6b7e1ed
      sdong 提交于
      Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.
      
      Test Plan: add a unit test
      
      Reviewers: yhchiang, ljin
      
      Reviewed By: ljin
      
      Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb
      
      Differential Revision: https://reviews.facebook.net/D20085
      f6b7e1ed
    • L
      make internal stats independent of statistics · f6f1533c
      Lei Jin 提交于
      Summary:
      also make it aware of column family
      output from db_bench
      
      ```
      ** Compaction Stats [default] **
      Level Files Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        L0    14      956   0.9      0.0     0.0      0.0       2.7      2.7    0.0   0.0      0.0    111.6        0         0         0         0         24        40    0.612      75.20     492387    0.15
        L1    21     2001   2.0      5.7     2.0      3.7       5.3      1.6    5.4   2.6     71.2     65.7       31        43        55        12         82         2   41.242      43.72      41183    1.06
        L2   217    18974   1.9     16.5     2.0     14.4      15.1      0.7   15.6   7.4     70.1     64.3       17       182       185         3        241        16   15.052       0.00          0    0.00
        L3  1641   188245   1.8      9.1     1.1      8.0       8.5      0.5   15.4   7.4     61.3     57.2        9        75        76         1        152         9   16.887       0.00          0    0.00
        L4  4447   449025   0.4     13.4     4.8      8.6       9.1      0.5    4.7   1.9     77.8     52.7       38        79       100        21        176        38    4.639       0.00          0    0.00
       Sum  6340   659201   0.0     44.7    10.0     34.7      40.6      6.0   32.0  15.2     67.7     61.6       95       379       416        37        676       105    6.439     118.91     533570    0.22
       Int     0        0   0.0      1.2     0.4      0.8       1.3      0.5    5.2   2.7     59.1     65.6        3         7         9         2         20        10    2.003       0.00          0    0.00
      Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown
      Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown
      
      ** DB Stats **
      Uptime(secs): 202.1 total, 13.5 interval
      Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB
      Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written
      Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB
      Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written
      
      Test Plan: ran it
      
      Reviewers: sdong, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19917
      f6f1533c
  2. 15 7月, 2014 1 次提交
    • I
      Remove stats logger · 20c05630
      Igor Canadi 提交于
      Summary: Browsing through the code, looks like StatsLogger is not used at all!
      
      Test Plan: compiles
      
      Reviewers: ljin, sdong, yhchiang, dhruba
      
      Reviewed By: dhruba
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D19827
      20c05630
  3. 04 7月, 2014 2 次提交
    • Y
      Finer report I/O stats about Flush and Compaction. · 90a6aca4
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff allows the I/O stats about Flush and Compaction to be reported
      in a more accurate way.  Instead of measuring the size of a file, it
      measure I/O cost in per read / write basis.
      
      Test Plan: make all check
      
      Reviewers: sdong, igor, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19383
      90a6aca4
    • Y
      Add timeout_hint_us to WriteOptions and introduce Status::TimeOut. · d4d338de
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff adds timeout_hint_us to WriteOptions.  If it's non-zero, then
      1) writes associated with this options MAY be aborted when it has been
        waiting for longer than the specified time.  If an abortion happens,
        associated writes will return Status::TimeOut.
      2) the stall time of the associated write caused by flush or compaction
        will be limited by timeout_hint_us.
      
      The default value of timeout_hint_us is 0 (i.e., OFF.)
      
      The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.
      
      Test Plan:
      export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
      make db_test
      ./db_test
      
      Reviewers: igor, ljin, haobo, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D18837
      d4d338de
  4. 03 7月, 2014 2 次提交
    • S
      Support Multiple DB paths (without having an interface to expose to users) · 2459f7ec
      sdong 提交于
      Summary:
      In this patch, we allow RocksDB to support multiple DB paths internally.
      No user interface is supported yet so this patch is silent to users.
      
      Test Plan: make all check
      
      Reviewers: igor, haobo, ljin, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D18921
      2459f7ec
    • I
      Centralize compression decision to compaction picker · f146cab2
      Igor Canadi 提交于
      Summary:
      Before this diff, we're deciding enable_compression in CompactionPicker and then we're deciding final compression type in DBImpl. This is kind of confusing.
      
      After the diff, the final compression type will be decided in CompactionPicker.
      
      The reason for this is that I want CompactFiles() to specify output compression type, so that people can mix and match compression styles in their compaction algorithms. This diff makes it much easier to do that.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, sdong, yhchiang, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19137
      f146cab2
  5. 20 6月, 2014 1 次提交
  6. 07 6月, 2014 1 次提交
  7. 03 6月, 2014 2 次提交
    • S
      In DB::NewIterator(), try to allocate the whole iterator tree in an arena · df9069d2
      sdong 提交于
      Summary:
      In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
      1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
      2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
      3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
      
      Limitations:
      (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
      (2) Two level iterator itself is allocated in arena, but not iterators inside it.
      
      Test Plan: make all check
      
      Reviewers: ljin, haobo
      
      Reviewed By: haobo
      
      Subscribers: leveldb, dhruba, yhchiang, igor
      
      Differential Revision: https://reviews.facebook.net/D18513
      df9069d2
    • I
      Only signal cond variable if need to · 91ddd587
      Igor Canadi 提交于
      Summary:
      At the end of BackgroundCallCompaction(), we call SignalAll(), even though we don't need to. If compaction hasn't done anything and there's another compaction running, there is no need to signal on the condition variable. Doing so creates a tight feedback loop which results in log files like:
      
         wait for memtable flush
         compaction nothing to do
         wait for memtable flush
         compaction nothing to do
      
      This change eliminates that
      
      Test Plan:
      make check
      Also:
      
          icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG | wc -l
          7435
          icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG | wc -l
          372
      
      First version is before the change, second version is after the change.
      
      Reviewers: dhruba, ljin, haobo, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D18855
      91ddd587
  8. 31 5月, 2014 1 次提交
    • L
      forward iterator · 388d2054
      Lei Jin 提交于
      Summary:
      Forward iterator puts everything together in a flat structure instead of
      a hierarchy of nested iterators. this should simplify the code and
      provide better performance. It also enables more optimization since all
      information are accessiable in one place.
      Init evaluation shows about 6% improvement
      
      Test Plan: db_test and db_bench
      
      Reviewers: dhruba, igor, tnovak, sdong, haobo
      
      Reviewed By: haobo
      
      Subscribers: sdong, leveldb
      
      Differential Revision: https://reviews.facebook.net/D18795
      388d2054
  9. 01 5月, 2014 1 次提交
    • I
      Flush stale column families · df700476
      Igor Canadi 提交于
      Summary:
      Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.
      
      By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.
      
      Test Plan: Added a test
      
      Reviewers: dhruba, haobo, sdong, ljin, yhchiang
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D18345
      df700476
  10. 30 4月, 2014 2 次提交
    • Y
      Add a new mem-table representation based on cuckoo hash. · 9d9d2965
      Yueh-Hsuan Chiang 提交于
      Summary:
      = Major Changes =
      * Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
        Cuckoo hash uses multiple hash functions.  This allows each key to have multiple
        possible locations in the mem-table.
      
        - Put: When insert a key, it will try to find whether one of its possible
          locations is vacant and store the key.  If none of its possible
          locations are available, then it will kick out a victim key and
          store at that location.  The kicked-out victim key will then be
          stored at a vacant space of its possible locations or kick-out
          another victim.  In this diff, the kick-out path (known as
          cuckoo-path) is found using BFS, which guarantees to be the shortest.
      
       - Get: Simply tries all possible locations of a key --- this guarantees
         worst-case constant time complexity.
      
       - Time complexity: O(1) for Get, and average O(1) for Put if the
         fullness of the mem-table is below 80%.
      
       - Default using two hash functions, the number of hash functions used
         by the cuckoo-hash may dynamically increase if it fails to find a
         short-enough kick-out path.
      
       - Currently, HashCuckooRep does not support iteration and snapshots,
         as our current main purpose of this is to optimize point access.
      
      = Minor Changes =
      * Add IsSnapshotSupported() to DB to indicate whether the current DB
        supports snapshots.  If it returns false, then DB::GetSnapshot() will
        always return nullptr.
      
      Test Plan:
      Run existing tests.  Will develop a test specifically for cuckoo hash in
      the next diff.
      
      Reviewers: sdong, haobo
      
      Reviewed By: sdong
      
      CC: leveldb, dhruba, igor
      
      Differential Revision: https://reviews.facebook.net/D16155
      9d9d2965
    • I
      Cache result of ReadFirstRecord() · dd9eb7a7
      Igor Canadi 提交于
      Summary:
      ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince().
      
      I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage.
      
      Test Plan: make check
      
      Reviewers: haobo, sdong, dhruba
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D18387
      dd9eb7a7
  11. 26 4月, 2014 1 次提交
  12. 16 4月, 2014 3 次提交
    • I
      Fix compile issues when doing make release · 7d838856
      Igor Canadi 提交于
      7d838856
    • I
      RocksDBLite · 588bca20
      Igor Canadi 提交于
      Summary:
      Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.
      
      Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)
      
      Test Plan: compiles :)
      
      Reviewers: dhruba, haobo, ljin, sdong, yhchiang
      
      Reviewed By: yhchiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D17835
      588bca20
    • I
      Don't roll empty logs · e6acb874
      Igor Canadi 提交于
      Summary:
      With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log).
      
      After the diff, we won't create new log file if current is empty.
      
      Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while)
      
      Test Plan: Added an unit test. Confirmed that unit test failes in master
      
      Reviewers: dhruba, haobo, ljin, sdong
      
      Reviewed By: ljin
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D17631
      e6acb874
  13. 15 4月, 2014 1 次提交
  14. 08 4月, 2014 1 次提交
  15. 05 4月, 2014 1 次提交
    • S
      Create log::Writer out of DB Mutex · ea0198fe
      sdong 提交于
      Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.
      
      Test Plan: make all check
      
      Reviewers: haobo, ljin, igor
      
      Reviewed By: haobo
      
      CC: nkg-, yhchiang, leveldb
      
      Differential Revision: https://reviews.facebook.net/D17487
      ea0198fe
  16. 04 4月, 2014 1 次提交
  17. 03 4月, 2014 1 次提交
    • H
      [RocksDB] Fix a race condition in GetSortedWalFiles · 48bc0c6a
      Haobo Xu 提交于
      Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.
      
      Test Plan: TransactionLogIteratorRace; make check
      
      Reviewers: dhruba, ljin
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D17121
      48bc0c6a
  18. 28 3月, 2014 1 次提交
  19. 25 3月, 2014 1 次提交
    • D
      [rocksdb] new CompactionFilterV2 API · b47812fb
      Danny Guo 提交于
      Summary:
      This diff adds a new CompactionFilterV2 API that roll up the
      decisions of kv pairs during compactions. These kv pairs must share the
      same key prefix. They are buffered inside the db.
      
          typedef std::vector<Slice> SliceVector;
          virtual std::vector<bool> Filter(int level,
                                       const SliceVector& keys,
                                       const SliceVector& existing_values,
                                       std::vector<std::string>* new_values,
                                       std::vector<bool>* values_changed
                                       ) const = 0;
      
      Application can override the Filter() function to operate
      on the buffered kv pairs. More details in the inline documentation.
      
      Test Plan:
      make check. Added unit tests to make sure Keep, Delete,
      Change all works.
      
      Reviewers: haobo
      
      CCs: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15087
      b47812fb
  20. 21 3月, 2014 1 次提交
  21. 19 3月, 2014 1 次提交
  22. 18 3月, 2014 1 次提交
    • I
      Fix race condition in manifest roll · ae25742a
      Igor Canadi 提交于
      Summary:
      When the manifest is getting rolled the following happens:
      1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
      2) mutex is unlocked
      3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
      4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
      5) mutex is locked
      
      If FindObsoleteFiles happens between (3) and (4) it will:
      1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
      2) Delete old manifest (because the manifest_file_number_ already points to a new one)
      
      I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.
      
      However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?
      
      Test Plan: make check
      
      Reviewers: ljin, dhruba, haobo, sdong
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16929
      ae25742a
  23. 12 3月, 2014 2 次提交
    • S
      Fix data race against logging data structure because of LogBuffer · bd45633b
      sdong 提交于
      Summary:
      @igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.
      
      With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.
      
      Thanks @igor for this finding!
      
      Test Plan: make all check
      
      Reviewers: haobo, igor
      
      Reviewed By: haobo
      
      CC: dhruba, ljin, yhchiang, igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D16767
      bd45633b
    • I
      [CF] db_stress for column families · 457c78eb
      Igor Canadi 提交于
      Summary:
      I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command:
      
         time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1  --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress
      
      It is ready to be committed :)
      
      Test Plan: Ran it for 10 hours
      
      Reviewers: dhruba, haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D16797
      457c78eb
  24. 11 3月, 2014 2 次提交
  25. 08 3月, 2014 2 次提交
  26. 06 3月, 2014 1 次提交
    • S
      Buffer info logs when picking compactions and write them out after releasing the mutex · ecb1ffa2
      sdong 提交于
      Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.
      
      Test Plan:
      make all check
      check the log lines while running some tests that trigger compactions.
      
      Reviewers: haobo, igor, dhruba
      
      Reviewed By: dhruba
      
      CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-
      
      Differential Revision: https://reviews.facebook.net/D16515
      ecb1ffa2
  27. 01 3月, 2014 1 次提交
    • Y
      Add ReadOptions to TransactionLogIterator. · a77527f2
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
      which allows the verification of checksums to be disabled by setting
      ReadOptions::verify_checksums to false.
      
      Test Plan: Tests are done off-line and will not be included in the regular unit test.
      
      Reviewers: igor
      
      Reviewed By: igor
      
      CC: leveldb, xjin, dhruba
      
      Differential Revision: https://reviews.facebook.net/D16305
      a77527f2
  28. 28 2月, 2014 1 次提交
  29. 27 2月, 2014 1 次提交
  30. 26 2月, 2014 1 次提交