1. 23 8月, 2013 1 次提交
    • J
      Add three new MemTableRep's · 74781a0c
      Jim Paton 提交于
      Summary:
      This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.
      
      UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.
      
      VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.
      
      PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.
      
      I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).
      
      Test Plan:
      make -j32 check
      ./db_stress --memtablerep=vector
      ./db_stress --memtablerep=unsorted
      ./db_stress --memtablerep=prefixhash --prefix_size=10
      
      Reviewers: dhruba, haobo, emayanke
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12117
      74781a0c
  2. 21 8月, 2013 1 次提交
  3. 20 8月, 2013 1 次提交
  4. 15 8月, 2013 1 次提交
    • J
      Implement log blobs · 0307c5fe
      Jim Paton 提交于
      Summary:
      This patch adds the ability for the user to add sequences of arbitrary data (blobs) to write batches. These blobs are saved to the log along with everything else in the write batch. You can add multiple blobs per WriteBatch and the ordering of blobs, puts, merges, and deletes are preserved.
      
      Blobs are not saves to SST files. RocksDB ignores blobs in every way except for writing them to the log.
      
      Before committing this patch, I need to add some test code. But I'm submitting it now so people can comment on the API.
      
      Test Plan: make -j32 check
      
      Reviewers: dhruba, haobo, vamsi
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12195
      0307c5fe
  5. 14 8月, 2013 2 次提交
    • T
      Prefix filters for scans (v4) · f5f18422
      Tyler Harter 提交于
      Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3.  Also, make the CreateFilter code faster and cleaner.
      
      Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: haobo, emayanke
      
      Differential Revision: https://reviews.facebook.net/D12027
      f5f18422
    • S
      Separate compaction filter for each compaction · 3b81df34
      sumeet 提交于
      Summary:
      If we have same compaction filter for each compaction,
      application cannot know about the different compaction processes.
      Later on, we can put in more details in compaction filter for the
      application to consume and use it according to its needs. For e.g. In
      the universal compaction, we have a compaction process involving all the
      files while others don't involve all the files. Applications may want to
      collect some stats only when during full compaction.
      
      Test Plan: run existing unit tests
      
      Reviewers: haobo, dhruba
      
      Reviewed By: dhruba
      
      CC: xinyaohu, leveldb
      
      Differential Revision: https://reviews.facebook.net/D12057
      3b81df34
  6. 10 8月, 2013 2 次提交
    • D
      Universal Compaction should keep DeleteMarkers unless it is the earliest file. · 93d77a27
      Dhruba Borthakur 提交于
      Summary:
      The pre-existing code was purging a DeleteMarker if thay key did not
      exist in deeper levels.  But in the Universal Compaction Style, all
      files are in Level0. For compaction runs that did not include the
      earliest file, we were erroneously purging the DeleteMarkers.
      
      The fix is to purge DeleteMarkers only if the compaction includes
      the earlist file.
      
      Test Plan: DBTest.Randomized triggers this code path.
      
      Differential Revision: https://reviews.facebook.net/D12081
      93d77a27
    • X
      Fix unit tests for universal compaction (step 2) · 8ae905ed
      Xing Jin 提交于
      Summary:
      Continue fixing existing unit tests for universal compaction. I have
      tried to apply universal compaction to all unit tests those haven't
      called ChangeOptions(). I left a few which are either apparently not
      applicable to universal compaction (because they check files/keys/values
      at level 1 or above levels), or apparently not related to compaction
      (e.g., open a file, open a db).
      
      I also add a new unit test for universal compaction.
      
      Good news is I didn't see any bugs during this round.
      
      Test Plan: Ran "make all check" yesterday. Has rebased and is rerunning
      
      Reviewers: haobo, dhruba
      
      Differential Revision: https://reviews.facebook.net/D12135
      8ae905ed
  7. 08 8月, 2013 1 次提交
    • X
      Fix unit tests/bugs for universal compaction (first step) · 17b8f786
      Xing Jin 提交于
      Summary:
      This is the first step to fix unit tests and bugs for universal
      compactiion. I added universal compaction option to ChangeOptions(), and
      fixed all unit tests calling ChangeOptions(). Some of these tests
      obviously assume more than 1 level and check file number/values in level
      1 or above levels. I set kSkipUniversalCompaction for these tests.
      
      The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
      it.
      
      I have also set universal compaction as the default compaction and found
      at least 20+ unit tests failing. I haven't looked into the details. The
      next step is to check all unit tests without calling ChangeOptions().
      
      Test Plan: make all check
      
      Reviewers: dhruba, haobo
      
      Differential Revision: https://reviews.facebook.net/D12051
      17b8f786
  8. 06 8月, 2013 1 次提交
    • J
      Add soft and hard rate limit support · 1036537c
      Jim Paton 提交于
      Summary:
      This diff adds support for both soft and hard rate limiting. The following changes are included:
      
      1) Options.rate_limit is renamed to Options.hard_rate_limit.
      2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
      3) Options.soft_rate_limit is added.
      4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
      5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
      6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
      
      Test Plan:
      make -j32 check
      ./db_stress
      
      Reviewers: dhruba, haobo, MarkCallaghan
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12003
      1036537c
  9. 02 8月, 2013 1 次提交
    • M
      Expand KeyMayExist to return the proper value if it can be found in memory and... · 59d0b02f
      Mayank Agarwal 提交于
      Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache
      
      Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test
      
      Test Plan: make all check;db_stress for 1 hour
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11853
      59d0b02f
  10. 24 7月, 2013 1 次提交
    • M
      Use KeyMayExist for WriteBatch-Deletes · bf66c10b
      Mayank Agarwal 提交于
      Summary:
      Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
      Added code to skip getting Table from disk if not already present in table_cache.
      Some renaming of variables.
      Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
      Changed KeyMayExist to not be pure virtual and provided a default implementation.
      Expanded unit-tests in db_test to check appropriately.
      Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
      
      Test Plan: db_stress;make check
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D11745
      bf66c10b
  11. 20 7月, 2013 1 次提交
  12. 13 7月, 2013 1 次提交
  13. 12 7月, 2013 1 次提交
    • M
      Make rocksdb-deletes faster using bloom filter · 2a986919
      Mayank Agarwal 提交于
      Summary:
      Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
      1. Put of delete type
      2. Space in the db,and
      3. Compaction time
      
      Test Plan:
      make all check;
      will run db_stress and db_bench and enhance unit-test once the basic design gets approved
      
      Reviewers: dhruba, haobo, vamsi
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11607
      2a986919
  14. 01 7月, 2013 2 次提交
    • D
      Reduce write amplification by merging files in L0 back into L0 · 47c4191f
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      47c4191f
    • D
      Reduce write amplification by merging files in L0 back into L0 · 554c06dd
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      554c06dd
  15. 13 6月, 2013 1 次提交
    • H
      [RocksDB] cleanup EnvOptions · bdf10859
      Haobo Xu 提交于
      Summary:
      This diff simplifies EnvOptions by treating it as POD, similar to Options.
      - virtual functions are removed and member fields are accessed directly.
      - StorageOptions is removed.
      - Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
      - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
      
      Test Plan: make check; db_stress
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11175
      bdf10859
  16. 06 6月, 2013 2 次提交
    • D
      Fixed valgrind errors · db1f0cdd
      Deon Nicholas 提交于
      db1f0cdd
    • D
      Very basic Multiget and simple test cases. · d8c7c45e
      Deon Nicholas 提交于
      Summary:
      Implemented the MultiGet operator which takes in a list of keys
      and returns their associated values. Currently uses std::vector as its
      container data structure. Otherwise, it works identically to "Get".
      
      Test Plan:
       1. make db_test      ; compile it
       2. ./db_test         ; test it
       3. make all check    ; regress / run all tests
       4. make release      ; (optional) compile with release settings
      
      Reviewers: haobo, MarkCallaghan, dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10875
      d8c7c45e
  17. 24 5月, 2013 1 次提交
  18. 14 5月, 2013 2 次提交
    • H
      [RocksDB] Cleanup compaction filter to use a class interface, instead of... · 4ca3c67b
      Haobo Xu 提交于
      [RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer.
      
      Summary:
      This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
      The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.
      
      Test Plan: make check
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb, zshao
      
      Differential Revision: https://reviews.facebook.net/D10773
      4ca3c67b
    • H
      [RocksDB] fix compaction filter trigger condition · 73c0a333
      Haobo Xu 提交于
      Summary:
      Currently, compaction filter is run on internal key older than the oldest snapshot, which is incorrect.
      Compaction filter should really be run on the most recent internal key when there is no external snapshot.
      
      Test Plan: make check; db_stress
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D10641
      73c0a333
  19. 07 5月, 2013 1 次提交
    • A
      [RocksDB] Clear Archive WAL files · 988c20b9
      Abhishek Kona 提交于
      Summary:
      WAL files are moved to archive directory and clear only at DB::Open.
      Can lead to a lot of space consumption in a Database. Added logic to periodically clear Archive Directory too.
      
      Test Plan: make all check + add unit test
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10617
      988c20b9
  20. 04 5月, 2013 1 次提交
    • H
      [Rocksdb] Support Merge operation in rocksdb · 05e88540
      Haobo Xu 提交于
      Summary:
      This diff introduces a new Merge operation into rocksdb.
      The purpose of this review is mostly getting feedback from the team (everyone please) on the design.
      
      Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
      include/leveldb/db.h
      include/leveldb/merge_operator.h
      include/leveldb/options.h
      include/leveldb/write_batch.h
      
      Please go over local/my_test.cc carefully, as it is a concerete use case.
      
      Please also review the impelmentation files to see if the straw man implementation makes sense.
      
      Note that, the diff does pass all make check and truly supports forward iterator over db and a version
      of Get that's based on iterator.
      
      Future work:
      - Integration with compaction
      - A raw Get implementation
      
      I am working on a wiki that explains the design and implementation choices, but coding comes
      just naturally and I think it might be a good idea to share the code earlier. The code is
      heavily commented.
      
      Test Plan: run all local tests
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: dhruba
      
      CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D9651
      05e88540
  21. 23 4月, 2013 1 次提交
  22. 21 4月, 2013 1 次提交
    • H
      [RocksDB] CompactionFilter cleanup · b4243e5a
      Haobo Xu 提交于
      Summary:
      - removed the compaction_filter_value from the callback interface. Restrict compaction filter to purging values.
      - modify some comments to reflect curent status.
      
      Test Plan: make check
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10335
      b4243e5a
  23. 09 4月, 2013 1 次提交
    • A
      [RocksDB][Bug] Look at all the files, not just the first file in... · 574b76f7
      Abhishek Kona 提交于
      [RocksDB][Bug] Look at all the files, not just the first file in TransactionLogIter as BatchWrites can leave it in Limbo
      
      Summary:
      Transaction Log Iterator did not move to the next file in the series if there was a write batch at the end of the currentFile.
      The solution is if the last seq no. of the current file is < RequestedSeqNo. Assume the first seqNo. of the next file has to satisfy the request.
      
      Also major refactoring around the code. Moved opening the logreader to a seperate function, got rid of goto.
      
      Test Plan: added a unit test for it.
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: leveldb, emayanke
      
      Differential Revision: https://reviews.facebook.net/D10029
      574b76f7
  24. 03 4月, 2013 1 次提交
  25. 29 3月, 2013 2 次提交
    • A
      [Rocksdb] Fix Crash on finding a db with no log files. Error out instead · 8e9c781a
      Abhishek Kona 提交于
      Summary:
      If the vector returned by GetUpdatesSince is empty, it is still returned to the
      user. This causes it throw an std::range error.
      The probable file list is checked and it returns an IOError status instead of OK now.
      
      Test Plan: added a unit test.
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9771
      8e9c781a
    • A
      Use non-mmapd files for Write-Ahead Files · 7fdd5f5b
      Abhishek Kona 提交于
      Summary:
      Use non mmapd files for Write-Ahead log.
      Earlier use of MMaped files. made the log iterator read ahead and miss records.
      Now the reader and writer will point to the same physical location.
      
      There is no perf regression :
      ./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
      with This diff :
      fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
      without this dif :
      fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s
      
      Test Plan: unit test included
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9741
      7fdd5f5b
  26. 22 3月, 2013 2 次提交
  27. 21 3月, 2013 1 次提交
    • D
      Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. · ad96563b
      Dhruba Borthakur 提交于
      Summary:
      This patch allows an application to specify whether to use bufferedio,
      reads-via-mmaps and writes-via-mmaps per database. Earlier, there
      was a global static variable that was used to configure this functionality.
      
      The default setting remains the same (and is backward compatible):
       1. use bufferedio
       2. do not use mmaps for reads
       3. use mmap for writes
       4. use readaheads for reads needed for compaction
      
      I also added a parameter to db_bench to be able to explicitly specify
      whether to do readaheads for compactions or not.
      
      Test Plan: make check
      
      Reviewers: sheki, heyongqiang, MarkCallaghan
      
      Reviewed By: sheki
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9429
      ad96563b
  28. 20 3月, 2013 2 次提交
  29. 07 3月, 2013 1 次提交
    • A
      Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file · d68880a1
      Abhishek Kona 提交于
      Summary:
      Store the last flushed, seq no. in db_impl. Check against it in
      transaction Log iterator. Do not attempt to read ahead if we do not know
      if the data is flushed completely.
      Does not work if flush is disabled. Any ideas on fixing that?
      * Minor change, iter->Next is called the first time automatically for
      * the first time.
      
      Test Plan:
      existing test pass.
      More ideas on testing this?
      Planning to run some stress test.
      
      Reviewers: dhruba, heyongqiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9087
      d68880a1
  30. 04 3月, 2013 2 次提交
    • M
      Add rate_delay_limit_milliseconds · 993543d1
      Mark Callaghan 提交于
      Summary:
      This adds the rate_delay_limit_milliseconds option to make the delay
      configurable in MakeRoomForWrite when the max compaction score is too high.
      This delay is called the Ln slowdown. This change also counts the Ln slowdown
      per level to make it possible to see where the stalls occur.
      
      From IO-bound performance testing, the Level N stalls occur:
      * with compression -> at the largest uncompressed level. This makes sense
                            because compaction for compressed levels is much
                            slower. When Lx is uncompressed and Lx+1 is compressed
                            then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                            compaction process is the first to be slowed by
                            compression.
      * without compression -> at level 1
      
      Task ID: #1832108
      
      Blame Rev:
      
      Test Plan:
      run with real data, added test
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D9045
      993543d1
    • D
      Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0. · 806e2643
      Dhruba Borthakur 提交于
      Summary:
      Rocks accumulates recent writes and deletes in the in-memory memtable.
      When the memtable is full, it writes the contents on the memtable to
      a file in L0.
      
      This patch removes redundant records at the time of the flush. If there
      are multiple versions of the same key in the memtable, then only the
      most recent one is dumped into the output file. The purging of
      redundant records occur only if the most recent snapshot is earlier
      than the earliest record in the memtable.
      
      Should we switch on this feature by default or should we keep this feature
      turned off in the default settings?
      
      Test Plan: Added test case to db_test.cc
      
      Reviewers: sheki, vamsi, emayanke, heyongqiang
      
      Reviewed By: sheki
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D8991
      806e2643
  31. 01 3月, 2013 1 次提交