1. 04 10月, 2013 1 次提交
  2. 03 10月, 2013 1 次提交
  3. 29 9月, 2013 1 次提交
  4. 26 9月, 2013 1 次提交
  5. 24 9月, 2013 1 次提交
  6. 20 9月, 2013 1 次提交
    • D
      Better locking in vectorrep that increases throughput to match speed of storage. · 5e9f3a9a
      Dhruba Borthakur 提交于
      Summary:
      There is a use-case where we want to insert data into rocksdb as
      fast as possible. Vector rep is used for this purpose.
      
      The background flush thread needs to flush the vectorrep to
      storage. It acquires the dblock then sorts the vector, releases
      the dblock and then writes the sorted vector to storage. This is
      suboptimal because the lock is held during the sort, which
      prevents new writes for occuring.
      
      This patch moves the sorting of the vector rep to outside the
      db mutex. Performance is now as fastas the underlying storage
      system. If you are doing buffered writes to rocksdb files, then
      you can observe throughput upwards of 200 MB/sec writes.
      
      This is an early draft and not yet ready to be reviewed.
      
      Test Plan:
      make check
      
      Task ID: #
      
      Blame Rev:
      
      Reviewers: haobo
      
      Reviewed By: haobo
      
      CC: leveldb, haobo
      
      Differential Revision: https://reviews.facebook.net/D12987
      5e9f3a9a
  7. 16 9月, 2013 2 次提交
  8. 14 9月, 2013 2 次提交
    • H
      [RocksDB] fix build env_test · 88664480
      Haobo Xu 提交于
      Summary: move the TwoPools test to the end of thread related tests. Otherwise, the SetBackgroundThreads call would increase the Low pool size and affect the result of other tests.
      
      Test Plan: make env_test; ./env_test
      
      Reviewers: dhruba, emayanke, xjin
      
      Reviewed By: xjin
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12939
      88664480
    • D
      Added a parameter to limit the maximum space amplification for universal compaction. · 4012ca1c
      Dhruba Borthakur 提交于
      Summary:
      Added a new field called max_size_amplification_ratio in the
      CompactionOptionsUniversal structure. This determines the maximum
      percentage overhead of space amplification.
      
      The size amplification is defined to be the ratio between the size of
      the oldest file to the sum of the sizes of all other files. If the
      size amplification exceeds the specified value, then min_merge_width
      and max_merge_width are ignored and a full compaction of all files is done.
      A value of 10 means that the size a database that stores 100 bytes
      of user data could occupy 110 bytes of physical storage.
      
      Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.
      
      Reviewers: haobo, emayanke, xjin
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12825
      4012ca1c
  9. 13 9月, 2013 2 次提交
    • H
      [RocksDB] Enhance Env to support two thread pools LOW and HIGH · 1565dab8
      Haobo Xu 提交于
      Summary:
      this is the ground work for separating memtable flush jobs to their own thread pool.
      Both SetBackgroundThreads and Schedule take a third parameter Priority to indicate which thread pool they are working on. The names LOW and HIGH are just identifiers for two different thread pools, and does not indicate real difference in 'priority'. We can set number of threads in the pools independently.
      The thread pool implementation is refactored.
      
      Test Plan: make check
      
      Reviewers: dhruba, emayanke
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12885
      1565dab8
    • H
      [RocksDB] Remove Log file immediately after memtable flush · 0e422308
      Haobo Xu 提交于
      Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.
      
      Test Plan: make check; db_bench
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11709
      0e422308
  10. 08 9月, 2013 1 次提交
    • H
      [RocksDB] Added nano second stopwatch and new perf counters to track block read cost · f2f4c807
      Haobo Xu 提交于
      Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.
      
      Test Plan: perf_context_test, also testing with viewstate shadow traffic.
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D12351
      f2f4c807
  11. 07 9月, 2013 1 次提交
    • D
      An iterator may automatically invoke reseeks. · 197034e4
      Dhruba Borthakur 提交于
      Summary:
      An iterator invokes reseek if the number of sequential skips over the
      same userkey exceeds a configured number. This makes iter->Next()
      faster (bacause of fewer key compares) if a large number of
      adjacent internal keys in a table (sst or memtable) have the
      same userkey.
      
      Test Plan: Unit test DBTest.IterReseek.
      
      Reviewers: emayanke, haobo, xjin
      
      Reviewed By: xjin
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D11865
      197034e4
  12. 05 9月, 2013 1 次提交
    • X
      New ldb command to convert compaction style · 42c109cc
      Xing Jin 提交于
      Summary:
      Add new command "change_compaction_style" to ldb tool. For
      universal->level, it shows "nothing to do". For level->universal, it
      compacts all files into a single one and moves the file to level 0.
      
      Also add check for number of files at level 1+ when opening db with
      universal compaction style.
      
      Test Plan:
      'make all check'. New unit test for internal convertion function. Also manully test various
      cmd like:
      
      ./ldb change_compaction_style --old_compaction_style=0
      --new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test
      
      Reviewers: haobo, dhruba
      
      Reviewed By: haobo
      
      CC: vamsi, emayanke
      
      Differential Revision: https://reviews.facebook.net/D12603
      42c109cc
  13. 24 8月, 2013 2 次提交
  14. 23 8月, 2013 1 次提交
    • J
      Add three new MemTableRep's · 74781a0c
      Jim Paton 提交于
      Summary:
      This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.
      
      UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.
      
      VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.
      
      PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.
      
      I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).
      
      Test Plan:
      make -j32 check
      ./db_stress --memtablerep=vector
      ./db_stress --memtablerep=unsorted
      ./db_stress --memtablerep=prefixhash --prefix_size=10
      
      Reviewers: dhruba, haobo, emayanke
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12117
      74781a0c
  15. 15 8月, 2013 1 次提交
    • T
      Add options to dump. · a8f47a40
      Tyler Harter 提交于
      Summary: added options to Dump() I missed in D12027.  I also ran a script to look for other missing options and found a couple which I added.  Should we also print anything for "PrepareForBulkLoad", "memtable_factory", and "statistics"?  Or should we leave those alone since it's not easy to print useful info for those?
      
      Test Plan: run anything and look at LOG file to make sure these are printed now.
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12219
      a8f47a40
  16. 14 8月, 2013 2 次提交
    • T
      Prefix filters for scans (v4) · f5f18422
      Tyler Harter 提交于
      Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3.  Also, make the CreateFilter code faster and cleaner.
      
      Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: haobo, emayanke
      
      Differential Revision: https://reviews.facebook.net/D12027
      f5f18422
    • S
      Separate compaction filter for each compaction · 3b81df34
      sumeet 提交于
      Summary:
      If we have same compaction filter for each compaction,
      application cannot know about the different compaction processes.
      Later on, we can put in more details in compaction filter for the
      application to consume and use it according to its needs. For e.g. In
      the universal compaction, we have a compaction process involving all the
      files while others don't involve all the files. Applications may want to
      collect some stats only when during full compaction.
      
      Test Plan: run existing unit tests
      
      Reviewers: haobo, dhruba
      
      Reviewed By: dhruba
      
      CC: xinyaohu, leveldb
      
      Differential Revision: https://reviews.facebook.net/D12057
      3b81df34
  17. 09 8月, 2013 1 次提交
  18. 08 8月, 2013 1 次提交
    • X
      Fix unit tests/bugs for universal compaction (first step) · 17b8f786
      Xing Jin 提交于
      Summary:
      This is the first step to fix unit tests and bugs for universal
      compactiion. I added universal compaction option to ChangeOptions(), and
      fixed all unit tests calling ChangeOptions(). Some of these tests
      obviously assume more than 1 level and check file number/values in level
      1 or above levels. I set kSkipUniversalCompaction for these tests.
      
      The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
      it.
      
      I have also set universal compaction as the default compaction and found
      at least 20+ unit tests failing. I haven't looked into the details. The
      next step is to check all unit tests without calling ChangeOptions().
      
      Test Plan: make all check
      
      Reviewers: dhruba, haobo
      
      Differential Revision: https://reviews.facebook.net/D12051
      17b8f786
  19. 06 8月, 2013 2 次提交
    • M
      Expose base db object from ttl wrapper · 1d7b4765
      Mayank Agarwal 提交于
      Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is'
      
      Test Plan: make
      
      Reviewers: dhruba, vamsi, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11919
      1d7b4765
    • J
      Add soft and hard rate limit support · 1036537c
      Jim Paton 提交于
      Summary:
      This diff adds support for both soft and hard rate limiting. The following changes are included:
      
      1) Options.rate_limit is renamed to Options.hard_rate_limit.
      2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
      3) Options.soft_rate_limit is added.
      4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
      5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
      6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
      
      Test Plan:
      make -j32 check
      ./db_stress
      
      Reviewers: dhruba, haobo, MarkCallaghan
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D12003
      1036537c
  20. 01 8月, 2013 1 次提交
    • X
      Make arena block size configurable · 0f0a24e2
      Xing Jin 提交于
      Summary:
      Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.
      
      I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.
      
      Test Plan:
      new unit test, I am running db_test.
      
      For passing paramter from configured option to Arena, I tried tests like:
      
        TEST(DBTest, Arena_Option) {
        std::string dbname = test::TmpDir() + "/db_arena_option_test";
        DestroyDB(dbname, Options());
      
        DB* db = nullptr;
        Options opts;
        opts.create_if_missing = true;
        opts.arena_block_size = 1000000; // tested 99, 999999
        Status s = DB::Open(opts, dbname, &db);
        db->Put(WriteOptions(), "a", "123");
        }
      
      and printed some debug info. The results look good. Any suggestion for such a unit-test?
      
      Reviewers: haobo, dhruba, emayanke, jpaton
      
      Reviewed By: dhruba
      
      CC: leveldb, zshao
      
      Differential Revision: https://reviews.facebook.net/D11799
      0f0a24e2
  21. 24 7月, 2013 2 次提交
    • J
      Virtualize SkipList Interface · 52d7ecfc
      Jim Paton 提交于
      Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.
      
      Test Plan:
      make clean
      make -j32 check
      ./db_stress
      
      Reviewers: dhruba, emayanke, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11739
      52d7ecfc
    • M
      Use KeyMayExist for WriteBatch-Deletes · bf66c10b
      Mayank Agarwal 提交于
      Summary:
      Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
      Added code to skip getting Table from disk if not already present in table_cache.
      Some renaming of variables.
      Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
      Changed KeyMayExist to not be pure virtual and provided a default implementation.
      Expanded unit-tests in db_test to check appropriately.
      Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
      
      Test Plan: db_stress;make check
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb, xjin
      
      Differential Revision: https://reviews.facebook.net/D11745
      bf66c10b
  22. 18 7月, 2013 1 次提交
  23. 12 7月, 2013 1 次提交
    • M
      Make rocksdb-deletes faster using bloom filter · 2a986919
      Mayank Agarwal 提交于
      Summary:
      Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
      1. Put of delete type
      2. Space in the db,and
      3. Compaction time
      
      Test Plan:
      make all check;
      will run db_stress and db_bench and enhance unit-test once the basic design gets approved
      
      Reviewers: dhruba, haobo, vamsi
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11607
      2a986919
  24. 10 7月, 2013 1 次提交
  25. 04 7月, 2013 2 次提交
  26. 01 7月, 2013 2 次提交
    • D
      Reduce write amplification by merging files in L0 back into L0 · 47c4191f
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      47c4191f
    • D
      Reduce write amplification by merging files in L0 back into L0 · 554c06dd
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      554c06dd
  27. 22 6月, 2013 1 次提交
    • M
      Simplify bucketing logic in ldb-ttl · b858da70
      Mayank Agarwal 提交于
      Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.
      
      Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1
      
      Reviewers: vamsi, haobo
      
      Reviewed By: vamsi
      
      Differential Revision: https://reviews.facebook.net/D11433
      b858da70
  28. 20 6月, 2013 2 次提交
    • M
      Introducing timeranged scan, timeranged dump in ldb. Also the ability to count... · 61f1baae
      Mayank Agarwal 提交于
      Introducing timeranged scan, timeranged dump in ldb. Also the ability to count in time-batches during Dump
      
      Summary:
      Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
      Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.
      
      Test Plan: python tools/ldb_test.py
      
      Reviewers: vamsi, dhruba, haobo, sheki
      
      Reviewed By: vamsi
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11403
      61f1baae
    • H
      [RocksDB] Add mmap_read option for db_stress · 96be2c4e
      Haobo Xu 提交于
      Summary: as title, also removed an incorrect assertion
      
      Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0
      
      Reviewers: dhruba, emayanke
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11367
      96be2c4e
  29. 19 6月, 2013 2 次提交