1. 01 7月, 2013 3 次提交
    • D
      Merge branch 'performance' of github.com:facebook/rocksdb into performance · a84f5470
      Dhruba Borthakur 提交于
      Conflicts:
      	db/db_bench.cc
      	db/version_set.cc
      	include/leveldb/options.h
      	util/options.cc
      a84f5470
    • D
      Reduce write amplification by merging files in L0 back into L0 · 47c4191f
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      47c4191f
    • D
      Reduce write amplification by merging files in L0 back into L0 · 554c06dd
      Dhruba Borthakur 提交于
      Summary:
      There is a new option called hybrid_mode which, when switched on,
      causes HBase style compactions.  Files from L0 are
      compacted back into L0. This meat of this compaction algorithm
      is in PickCompactionHybrid().
      
      All files reside in L0. That means all files have overlapping
      keys. Each file has a time-bound, i.e. each file contains a
      range of keys that were inserted around the same time. The
      start-seqno and the end-seqno refers to the timeframe when
      these keys were inserted.  Files that have contiguous seqno
      are compacted together into a larger file. All files are
      ordered from most recent to the oldest.
      
      The current compaction algorithm starts to look for
      candidate files starting from the most recent file. It continues to
      add more files to the same compaction run as long as the
      sum of the files chosen till now is smaller than the next
      candidate file size. This logic needs to be debated
      and validated.
      
      The above logic should reduce write amplification to a
      large extent... will publish numbers shortly.
      
      Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
      
      Differential Revision: https://reviews.facebook.net/D11289
      554c06dd
  2. 27 6月, 2013 2 次提交
    • H
      [RocksDB] Expose count for WriteBatch · 71e0f695
      Haobo Xu 提交于
      Summary: As title. Exposed a Count function that returns the number of updates in a batch. Could be handy for replication sequence number check.
      
      Test Plan: make check;
      
      Reviewers: emayanke, sheki, dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11523
      71e0f695
    • D
      Added stringappend_test back into the unit tests. · 34ef8732
      Deon Nicholas 提交于
      Summary:
      With the Makefile now updated to correctly update all .o files, this
      should fix the issues recompiling stringappend_test. This should also fix the
      "segmentation-fault" that we were getting earlier. Now, stringappend_test should
      be clean, and I have added it back to the unit-tests. Also made some minor updates
      to the tests themselves.
      
      Test Plan:
      1. make clean; make stringappend_test -j 32	(will test it by itself)
      2. make clean; make all check -j 32		(to run all unit tests)
      3. make clean; make release			(test in release mode)
      4. valgrind ./stringappend_test 		(valgrind tests)
      
      Reviewers: haobo, jpaton, dhruba
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11505
      34ef8732
  3. 26 6月, 2013 1 次提交
    • D
      Updated "make clean" to remove all .o files · 6894a50a
      Deon Nicholas 提交于
      Summary:
      The old Makefile did not remove ALL .o and .d files, but rather only
      those that happened to be in the root folder and one-level deep. This was causing
      issues when recompiling files in deeper folders. This fix now causes make clean
      to find ALL .o and .d files via a unix "find" command, and then remove them.
      
      Test Plan:
      make clean;
      make all -j 32;
      
      Reviewers: haobo, jpaton, dhruba
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11493
      6894a50a
  4. 22 6月, 2013 1 次提交
    • M
      Simplify bucketing logic in ldb-ttl · b858da70
      Mayank Agarwal 提交于
      Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.
      
      Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1
      
      Reviewers: vamsi, haobo
      
      Reviewed By: vamsi
      
      Differential Revision: https://reviews.facebook.net/D11433
      b858da70
  5. 20 6月, 2013 4 次提交
    • M
      Introducing timeranged scan, timeranged dump in ldb. Also the ability to count... · 61f1baae
      Mayank Agarwal 提交于
      Introducing timeranged scan, timeranged dump in ldb. Also the ability to count in time-batches during Dump
      
      Summary:
      Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
      Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.
      
      Test Plan: python tools/ldb_test.py
      
      Reviewers: vamsi, dhruba, haobo, sheki
      
      Reviewed By: vamsi
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11403
      61f1baae
    • H
      [RocksDB] add back --mmap_read options to crashtest · 0f78fad9
      Haobo Xu 提交于
      Summary: As title, now that db_stress supports --map_read properly
      
      Test Plan: make crash_test
      
      Reviewers: vamsi, emayanke, dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11391
      0f78fad9
    • H
      [RocksDB] Minor change to statistics.h · 4deaa0d4
      Haobo Xu 提交于
      Summary: as title, use initialize list so that lines fit in 80 chars.
      
      Test Plan: make check;
      
      Reviewers: sheki, dhruba
      
      Differential Revision: https://reviews.facebook.net/D11385
      4deaa0d4
    • H
      [RocksDB] Add mmap_read option for db_stress · 96be2c4e
      Haobo Xu 提交于
      Summary: as title, also removed an incorrect assertion
      
      Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0
      
      Reviewers: dhruba, emayanke
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11367
      96be2c4e
  6. 19 6月, 2013 5 次提交
  7. 18 6月, 2013 4 次提交
  8. 15 6月, 2013 4 次提交
  9. 14 6月, 2013 2 次提交
  10. 13 6月, 2013 3 次提交
    • H
      [RocksDB] Sync file to disk incrementally · 778e1790
      Haobo Xu 提交于
      Summary:
      During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
      This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
      In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.
      
      Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
      What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
      For the write only test 800bytes test:
      Before the change:  await  occillate between 10ms and 3ms
      After the change: await ranges 1-3ms
      
      Will test against read-modify-write workload too, see if high read latency P99 could be resolved.
      
      Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.
      
      Test Plan: make check; db_bench; db_stress
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11115
      778e1790
    • D
      [Rocksdb] [Multiget] Introduced multiget into db_bench · 4985a9f7
      Deon Nicholas 提交于
      Summary:
      Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
      flags for db_bench. Also updated and tested the ReadRandom() method
      to include an option to use multiget. By default,
      keys_per_multiget=100.
      
      Preliminary tests imply that multiget is at least 1.25x faster per
      key than regular get.
      
      Will continue adding Multiget for ReadMissing, ReadHot,
      RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
      about ways to better verify benchmarks.
      
      Test Plan:
      1. make db_bench
      2. ./db_bench --benchmarks=fillrandom
      3. ./db_bench --benchmarks=readrandom --use_existing_db=1
      	      --use_multiget=1 --threads=4 --keys_per_multiget=100
      4. ./db_bench --benchmarks=readrandom --use_existing_db=1
      	      --threads=4
      5. Verify ops/sec (and 1000000 of 1000000 keys found)
      
      Reviewers: haobo, MarkCallaghan, dhruba
      
      Reviewed By: MarkCallaghan
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11127
      4985a9f7
    • H
      [RocksDB] cleanup EnvOptions · bdf10859
      Haobo Xu 提交于
      Summary:
      This diff simplifies EnvOptions by treating it as POD, similar to Options.
      - virtual functions are removed and member fields are accessed directly.
      - StorageOptions is removed.
      - Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
      - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
      
      Test Plan: make check; db_stress
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11175
      bdf10859
  11. 12 6月, 2013 1 次提交
    • D
      Completed the implementation and test cases for Redis API. · 5679107b
      Deon Nicholas 提交于
      Summary:
      Completed the implementation for the Redis API for Lists.
      The Redis API uses rocksdb as a backend to persistently
      store maps from key->list. It supports basic operations
      for appending, inserting, pushing, popping, and accessing
      a list, given its key.
      
      Test Plan:
        - Compile with: make redis_test
        - Test with: ./redis_test
        - Run all unit tests (for all rocksdb) with: make all check
        - To use an interactive REDIS client use: ./redis_test -m
        - To clean the database before use:       ./redis_test -m -d
      
      Reviewers: haobo, dhruba, zshao
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10833
      5679107b
  12. 11 6月, 2013 6 次提交
    • D
      Do not submit multiple simultaneous seek-compaction requests. · e673d5d2
      Dhruba Borthakur 提交于
      Summary:
      The code was such that if multi-threaded-compactions as well
      as seek compaction are enabled then it submits multiple
      compaction request for the same range of keys. This causes
      extraneous sst-files to accumulate at various levels.
      
      Test Plan:
      I am not able to write a very good unit test for this one
      but can easily reproduce this bug with 'dbstress' with the
      following options.
      
      batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0;  t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch
      
      Reviewers: leveldb, emayanke
      
      Reviewed By: emayanke
      
      Differential Revision: https://reviews.facebook.net/D11055
      e673d5d2
    • M
      Make Write API work for TTL databases · 3c35eda9
      Mayank Agarwal 提交于
      Summary: Added logic to make another WriteBatch with Timestamps during the Write function execution in TTL class. Also expanded the ttl_test to test for it. Have done nothing for Merge for now.
      
      Test Plan: make ttl_test;./ttl_test
      
      Reviewers: haobo, vamsi, dhruba
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10827
      3c35eda9
    • D
      Fix refering freed memory in earlier commit. · 1b69f1e5
      Dhruba Borthakur 提交于
      Summary: Fix refering freed memory in earlier commit by https://reviews.facebook.net/D11181
      
      Test Plan: make check
      
      Reviewers: haobo, sheki
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11193
      1b69f1e5
    • A
      [Rocksdb] fix wrong assert · 4a8554d5
      Abhishek Kona 提交于
      Summary: the assert was wrong in D11145. Broke build
      
      Test Plan: make db_bench run it
      
      Reviewers: dhruba, haobo, emayanke
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11187
      4a8554d5
    • D
      Print name of user comparator in LOG. · c5de1b93
      Dhruba Borthakur 提交于
      Summary:
      The current code prints the name of the InternalKeyComparator
      in the log file. We would also like to print the name of the
      user-specified comparator for easier debugging.
      
      Test Plan: make check
      
      Reviewers: sheki
      
      Reviewed By: sheki
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11181
      c5de1b93
    • A
      [rocksdb] names for all metrics provided in statistics.h · a4913c51
      Abhishek Kona 提交于
      Summary: Provide a  map of histograms and ticker vs strings. Fb303 libraries can use this to provide the mapping. We will not have to duplicate the code during release.
      
      Test Plan: db_bench with statistics=1
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11145
      a4913c51
  13. 10 6月, 2013 2 次提交
    • M
      Max_mem_compaction_level can have maximum value of num_levels-1 · 184343a0
      Mayank Agarwal 提交于
      Summary:
      Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed:
      1. WriteLevel0Table was called when memtable was to be flushed for a file.
      2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed.
      3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0.
      The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct.
      
      Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this.
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11157
      184343a0
    • M
      Modifying options to db_stress when it is run with db_crashtest · 7a6bd8e9
      Mayank Agarwal 提交于
      Summary: These extra options caught some bugs. Will be run via Jenkins now with the crash_test
      
      Test Plan: ./make crashtest
      
      Reviewers: dhruba, vamsi
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11151
      7a6bd8e9
  14. 08 6月, 2013 2 次提交