1. 13 11月, 2013 1 次提交
  2. 07 11月, 2013 1 次提交
    • S
      WAL log retention policy based on archive size. · c2be2cba
      shamdor 提交于
      Summary:
      Archive cleaning will still happen every WAL_ttl seconds
      but archived logs will be deleted only if archive size
      is greater then a WAL_size_limit value.
      Empty archived logs will be deleted evety WAL_ttl.
      
      Test Plan:
      1. Unit tests pass.
      2. Benchmark.
      
      Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor
      
      Reviewed By: emayanke
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D13869
      c2be2cba
  3. 02 11月, 2013 1 次提交
    • D
      Implement a compressed block cache. · b4ad5e89
      Dhruba Borthakur 提交于
      Summary:
      Rocksdb can now support a uncompressed block cache, or a compressed
      block cache or both. Lookups first look for a block in the
      uncompressed cache, if it is not found only then it is looked up
      in the compressed cache. If it is found in the compressed cache,
      then it is uncompressed and inserted into the uncompressed cache.
      
      It is possible that the same block resides in the compressed cache
      as well as the uncompressed cache at the same time. Both caches
      have their own individual LRU policy.
      
      Test Plan: Unit test case attached.
      
      Reviewers: kailiu, sdong, haobo, leveldb
      
      Reviewed By: haobo
      
      CC: xjin, haobo
      
      Differential Revision: https://reviews.facebook.net/D12675
      b4ad5e89
  4. 24 10月, 2013 1 次提交
  5. 17 10月, 2013 1 次提交
  6. 12 10月, 2013 1 次提交
    • S
      LRUCache to try to clean entries not referenced first. · f8509653
      sdong 提交于
      Summary:
      With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\
      ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\
      he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used.
      
      Test Plan: add two unit tests to cover the codes
      
      Reviewers: dhruba, haobo, emayanke
      
      Reviewed By: emayanke
      
      CC: leveldb, emayanke, xjin
      
      Differential Revision: https://reviews.facebook.net/D13377
      f8509653
  7. 06 10月, 2013 1 次提交
  8. 05 10月, 2013 1 次提交
  9. 20 9月, 2013 1 次提交
    • D
      Better locking in vectorrep that increases throughput to match speed of storage. · 5e9f3a9a
      Dhruba Borthakur 提交于
      Summary:
      There is a use-case where we want to insert data into rocksdb as
      fast as possible. Vector rep is used for this purpose.
      
      The background flush thread needs to flush the vectorrep to
      storage. It acquires the dblock then sorts the vector, releases
      the dblock and then writes the sorted vector to storage. This is
      suboptimal because the lock is held during the sort, which
      prevents new writes for occuring.
      
      This patch moves the sorting of the vector rep to outside the
      db mutex. Performance is now as fastas the underlying storage
      system. If you are doing buffered writes to rocksdb files, then
      you can observe throughput upwards of 200 MB/sec writes.
      
      This is an early draft and not yet ready to be reviewed.
      
      Test Plan:
      make check
      
      Task ID: #
      
      Blame Rev:
      
      Reviewers: haobo
      
      Reviewed By: haobo
      
      CC: leveldb, haobo
      
      Differential Revision: https://reviews.facebook.net/D12987
      5e9f3a9a
  10. 16 9月, 2013 2 次提交
  11. 14 9月, 2013 1 次提交
    • D
      Added a parameter to limit the maximum space amplification for universal compaction. · 4012ca1c
      Dhruba Borthakur 提交于
      Summary:
      Added a new field called max_size_amplification_ratio in the
      CompactionOptionsUniversal structure. This determines the maximum
      percentage overhead of space amplification.
      
      The size amplification is defined to be the ratio between the size of
      the oldest file to the sum of the sizes of all other files. If the
      size amplification exceeds the specified value, then min_merge_width
      and max_merge_width are ignored and a full compaction of all files is done.
      A value of 10 means that the size a database that stores 100 bytes
      of user data could occupy 110 bytes of physical storage.
      
      Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.
      
      Reviewers: haobo, emayanke, xjin
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12825
      4012ca1c
  12. 13 9月, 2013 1 次提交
    • H
      [RocksDB] Remove Log file immediately after memtable flush · 0e422308
      Haobo Xu 提交于
      Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.
      
      Test Plan: make check; db_bench
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11709
      0e422308
  13. 24 8月, 2013 2 次提交
  14. 23 8月, 2013 4 次提交
    • J
      Add three new MemTableRep's · 74781a0c
      Jim Paton 提交于
      Summary:
      This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.
      
      UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.
      
      VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.
      
      PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.
      
      I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).
      
      Test Plan:
      make -j32 check
      ./db_stress --memtablerep=vector
      ./db_stress --memtablerep=unsorted
      ./db_stress --memtablerep=prefixhash --prefix_size=10
      
      Reviewers: dhruba, haobo, emayanke
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12117
      74781a0c
    • X
      Pull from https://reviews.facebook.net/D10917 · 17dc1280
      Xing Jin 提交于
      Summary: Pull Mark's patch and slightly revise it. I revised another place in db_impl.cc with similar new formula.
      
      Test Plan:
      make all check. Also run "time ./db_bench --num=2500000000 --numdistinct=2200000000". It has run for 20+ hours and hasn't finished. Looks good so far:
      
      Installed stack trace handler for SIGILL SIGSEGV SIGBUS SIGABRT
      LevelDB:    version 2.0
      Date:       Tue Aug 20 23:11:55 2013
      CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
      CPUCache:   20480 KB
      Keys:       16 bytes each
      Values:     100 bytes each (50 bytes after compression)
      Entries:    2500000000
      RawSize:    276565.6 MB (estimated)
      FileSize:   157356.3 MB (estimated)
      Write rate limit: 0
      Compression: snappy
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      ------------------------------------------------
      DB path: [/tmp/leveldbtest-3088/dbbench]
      fillseq      :    7202.000 micros/op 138 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      fillsync     :    7148.000 micros/op 139 ops/sec; (2500000 ops)
      DB path: [/tmp/leveldbtest-3088/dbbench]
      fillrandom   :    7105.000 micros/op 140 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      overwrite    :    6930.000 micros/op 144 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readrandom   :       1.020 micros/op 980507 ops/sec; (0 of 2500000000 found)
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readrandom   :       1.021 micros/op 979620 ops/sec; (0 of 2500000000 found)
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readseq      :     113.000 micros/op 8849 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readreverse  :     102.000 micros/op 9803 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      Created bg thread 0x7f0ac17f7700
      compact      :  111701.000 micros/op 8 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readrandom   :       1.020 micros/op 980376 ops/sec; (0 of 2500000000 found)
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readseq      :     120.000 micros/op 8333 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      readreverse  :      29.000 micros/op 34482 ops/sec;
      DB path: [/tmp/leveldbtest-3088/dbbench]
      ... finished 618100000 ops
      
      Reviewers: MarkCallaghan, haobo, dhruba, chip
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D12441
      17dc1280
    • T
      Revert "Prefix scan: db_bench and bug fixes" · 94cf2187
      Tyler Harter 提交于
      This reverts commit c2bd8f48.
      94cf2187
    • T
      Prefix scan: db_bench and bug fixes · c2bd8f48
      Tyler Harter 提交于
      Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target.  Still need to add statistics (perhaps in a separate diff).
      
      Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb, haobo
      
      Differential Revision: https://reviews.facebook.net/D12273
      c2bd8f48
  15. 21 8月, 2013 1 次提交
  16. 16 8月, 2013 2 次提交
    • D
      Tiny fix to db_bench for make release. · d1d3d15e
      Deon Nicholas 提交于
      Summary:
      In release, "found variable assigned but not used anywhere". Changed it to work with
      assert. Someone accept this :).
      
      Test Plan: make release -j 32
      
      Reviewers: haobo, dhruba, emayanke
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12309
      d1d3d15e
    • D
      Benchmarking for Merge Operator · ad48c3c2
      Deon Nicholas 提交于
      Summary:
      Updated db_bench and utilities/merge_operators.h to allow for dynamic benchmarking
      of merge operators in db_bench. Added a new test (--benchmarks=mergerandom), which performs
      a bunch of random Merge() operations over random keys. Also added a "--merge_operator=" flag
      so that the tester can easily benchmark different merge operators. Currently supports
      the PutOperator and UInt64Add operator. Support for stringappend or list append may come later.
      
      Test Plan:
      	1. make db_bench
      	2. Test the PutOperator (simulating Put) as follows:
      ./db_bench --benchmarks=fillrandom,readrandom,updaterandom,readrandom,mergerandom,readrandom --merge_operator=put
      --threads=2
      
      3. Test the UInt64AddOperator (simulating numeric addition) similarly:
      ./db_bench --value_size=8 --benchmarks=fillrandom,readrandom,updaterandom,readrandom,mergerandom,readrandom
      --merge_operator=uint64add --threads=2
      
      Reviewers: haobo, dhruba, zshao, MarkCallaghan
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11535
      ad48c3c2
  17. 15 8月, 2013 1 次提交
    • X
      Minor fix to current codes · 0a5afd1a
      Xing Jin 提交于
      Summary:
      Minor fix to current codes, including: coding style, output format,
      comments. No major logic change. There are only 2 real changes, please see my inline comments.
      
      Test Plan: make all check
      
      Reviewers: haobo, dhruba, emayanke
      
      Differential Revision: https://reviews.facebook.net/D12297
      0a5afd1a
  18. 06 8月, 2013 1 次提交
    • J
      Add soft and hard rate limit support · 1036537c
      Jim Paton 提交于
      Summary:
      This diff adds support for both soft and hard rate limiting. The following changes are included:
      
      1) Options.rate_limit is renamed to Options.hard_rate_limit.
      2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
      3) Options.soft_rate_limit is added.
      4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
      5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
      6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
      
      Test Plan:
      make -j32 check
      ./db_stress
      
      Reviewers: dhruba, haobo, MarkCallaghan
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12003
      1036537c
  19. 24 7月, 2013 1 次提交
    • M
      Use KeyMayExist for WriteBatch-Deletes · bf66c10b
      Mayank Agarwal 提交于
      Summary:
      Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
      Added code to skip getting Table from disk if not already present in table_cache.
      Some renaming of variables.
      Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
      Changed KeyMayExist to not be pure virtual and provided a default implementation.
      Expanded unit-tests in db_test to check appropriately.
      Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
      
      Test Plan: db_stress;make check
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D11745
      bf66c10b
  20. 12 7月, 2013 1 次提交
    • M
      Make rocksdb-deletes faster using bloom filter · 2a986919
      Mayank Agarwal 提交于
      Summary:
      Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
      1. Put of delete type
      2. Space in the db,and
      3. Compaction time
      
      Test Plan:
      make all check;
      will run db_stress and db_bench and enhance unit-test once the basic design gets approved
      
      Reviewers: dhruba, haobo, vamsi
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11607
      2a986919
  21. 10 7月, 2013 1 次提交
  22. 04 7月, 2013 1 次提交
  23. 01 7月, 2013 2 次提交
    • D
      Reduce write amplification by merging files in L0 back into L0 · 47c4191f
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      47c4191f
    • D
      Reduce write amplification by merging files in L0 back into L0 · 554c06dd
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      554c06dd
  24. 19 6月, 2013 4 次提交
  25. 15 6月, 2013 2 次提交
  26. 13 6月, 2013 2 次提交
    • D
      [Rocksdb] [Multiget] Introduced multiget into db_bench · 4985a9f7
      Deon Nicholas 提交于
      Summary:
      Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
      flags for db_bench. Also updated and tested the ReadRandom() method
      to include an option to use multiget. By default,
      keys_per_multiget=100.
      
      Preliminary tests imply that multiget is at least 1.25x faster per
      key than regular get.
      
      Will continue adding Multiget for ReadMissing, ReadHot,
      RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
      about ways to better verify benchmarks.
      
      Test Plan:
      1. make db_bench
      2. ./db_bench --benchmarks=fillrandom
      3. ./db_bench --benchmarks=readrandom --use_existing_db=1
      	      --use_multiget=1 --threads=4 --keys_per_multiget=100
      4. ./db_bench --benchmarks=readrandom --use_existing_db=1
      	      --threads=4
      5. Verify ops/sec (and 1000000 of 1000000 keys found)
      
      Reviewers: haobo, MarkCallaghan, dhruba
      
      Reviewed By: MarkCallaghan
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11127
      4985a9f7
    • H
      [RocksDB] cleanup EnvOptions · bdf10859
      Haobo Xu 提交于
      Summary:
      This diff simplifies EnvOptions by treating it as POD, similar to Options.
      - virtual functions are removed and member fields are accessed directly.
      - StorageOptions is removed.
      - Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
      - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
      
      Test Plan: make check; db_stress
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11175
      bdf10859
  27. 08 6月, 2013 1 次提交
  28. 06 6月, 2013 1 次提交