1. 14 6月, 2013 1 次提交
  2. 13 6月, 2013 3 次提交
    • H
      [RocksDB] Sync file to disk incrementally · 778e1790
      Haobo Xu 提交于
      Summary:
      During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
      This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
      In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.
      
      Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
      What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
      For the write only test 800bytes test:
      Before the change:  await  occillate between 10ms and 3ms
      After the change: await ranges 1-3ms
      
      Will test against read-modify-write workload too, see if high read latency P99 could be resolved.
      
      Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.
      
      Test Plan: make check; db_bench; db_stress
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11115
      778e1790
    • D
      [Rocksdb] [Multiget] Introduced multiget into db_bench · 4985a9f7
      Deon Nicholas 提交于
      Summary:
      Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
      flags for db_bench. Also updated and tested the ReadRandom() method
      to include an option to use multiget. By default,
      keys_per_multiget=100.
      
      Preliminary tests imply that multiget is at least 1.25x faster per
      key than regular get.
      
      Will continue adding Multiget for ReadMissing, ReadHot,
      RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
      about ways to better verify benchmarks.
      
      Test Plan:
      1. make db_bench
      2. ./db_bench --benchmarks=fillrandom
      3. ./db_bench --benchmarks=readrandom --use_existing_db=1
      	      --use_multiget=1 --threads=4 --keys_per_multiget=100
      4. ./db_bench --benchmarks=readrandom --use_existing_db=1
      	      --threads=4
      5. Verify ops/sec (and 1000000 of 1000000 keys found)
      
      Reviewers: haobo, MarkCallaghan, dhruba
      
      Reviewed By: MarkCallaghan
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11127
      4985a9f7
    • H
      [RocksDB] cleanup EnvOptions · bdf10859
      Haobo Xu 提交于
      Summary:
      This diff simplifies EnvOptions by treating it as POD, similar to Options.
      - virtual functions are removed and member fields are accessed directly.
      - StorageOptions is removed.
      - Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
      - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
      
      Test Plan: make check; db_stress
      
      Reviewers: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11175
      bdf10859
  3. 12 6月, 2013 1 次提交
    • D
      Completed the implementation and test cases for Redis API. · 5679107b
      Deon Nicholas 提交于
      Summary:
      Completed the implementation for the Redis API for Lists.
      The Redis API uses rocksdb as a backend to persistently
      store maps from key->list. It supports basic operations
      for appending, inserting, pushing, popping, and accessing
      a list, given its key.
      
      Test Plan:
        - Compile with: make redis_test
        - Test with: ./redis_test
        - Run all unit tests (for all rocksdb) with: make all check
        - To use an interactive REDIS client use: ./redis_test -m
        - To clean the database before use:       ./redis_test -m -d
      
      Reviewers: haobo, dhruba, zshao
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10833
      5679107b
  4. 11 6月, 2013 6 次提交
    • D
      Do not submit multiple simultaneous seek-compaction requests. · e673d5d2
      Dhruba Borthakur 提交于
      Summary:
      The code was such that if multi-threaded-compactions as well
      as seek compaction are enabled then it submits multiple
      compaction request for the same range of keys. This causes
      extraneous sst-files to accumulate at various levels.
      
      Test Plan:
      I am not able to write a very good unit test for this one
      but can easily reproduce this bug with 'dbstress' with the
      following options.
      
      batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0;  t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch
      
      Reviewers: leveldb, emayanke
      
      Reviewed By: emayanke
      
      Differential Revision: https://reviews.facebook.net/D11055
      e673d5d2
    • M
      Make Write API work for TTL databases · 3c35eda9
      Mayank Agarwal 提交于
      Summary: Added logic to make another WriteBatch with Timestamps during the Write function execution in TTL class. Also expanded the ttl_test to test for it. Have done nothing for Merge for now.
      
      Test Plan: make ttl_test;./ttl_test
      
      Reviewers: haobo, vamsi, dhruba
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10827
      3c35eda9
    • D
      Fix refering freed memory in earlier commit. · 1b69f1e5
      Dhruba Borthakur 提交于
      Summary: Fix refering freed memory in earlier commit by https://reviews.facebook.net/D11181
      
      Test Plan: make check
      
      Reviewers: haobo, sheki
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11193
      1b69f1e5
    • A
      [Rocksdb] fix wrong assert · 4a8554d5
      Abhishek Kona 提交于
      Summary: the assert was wrong in D11145. Broke build
      
      Test Plan: make db_bench run it
      
      Reviewers: dhruba, haobo, emayanke
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11187
      4a8554d5
    • D
      Print name of user comparator in LOG. · c5de1b93
      Dhruba Borthakur 提交于
      Summary:
      The current code prints the name of the InternalKeyComparator
      in the log file. We would also like to print the name of the
      user-specified comparator for easier debugging.
      
      Test Plan: make check
      
      Reviewers: sheki
      
      Reviewed By: sheki
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11181
      c5de1b93
    • A
      [rocksdb] names for all metrics provided in statistics.h · a4913c51
      Abhishek Kona 提交于
      Summary: Provide a  map of histograms and ticker vs strings. Fb303 libraries can use this to provide the mapping. We will not have to duplicate the code during release.
      
      Test Plan: db_bench with statistics=1
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11145
      a4913c51
  5. 10 6月, 2013 2 次提交
    • M
      Max_mem_compaction_level can have maximum value of num_levels-1 · 184343a0
      Mayank Agarwal 提交于
      Summary:
      Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed:
      1. WriteLevel0Table was called when memtable was to be flushed for a file.
      2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed.
      3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0.
      The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct.
      
      Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this.
      
      Reviewers: dhruba, haobo
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11157
      184343a0
    • M
      Modifying options to db_stress when it is run with db_crashtest · 7a6bd8e9
      Mayank Agarwal 提交于
      Summary: These extra options caught some bugs. Will be run via Jenkins now with the crash_test
      
      Test Plan: ./make crashtest
      
      Reviewers: dhruba, vamsi
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11151
      7a6bd8e9
  6. 08 6月, 2013 3 次提交
  7. 07 6月, 2013 1 次提交
  8. 06 6月, 2013 4 次提交
  9. 05 6月, 2013 1 次提交
  10. 04 6月, 2013 2 次提交
    • M
      Improve output for GetProperty('leveldb.stats') · d9f538e1
      Mark Callaghan 提交于
      Summary:
      Display separate values for read, write & total compaction IO.
      Display compaction amplification and write amplification.
      Add similar values for the period since the last call to GetProperty. Results since the server started
      are reported as "cumulative" stats. Results since the last call to GetProperty are reported as
      "interval" stats.
      
      Level  Files Size(MB) Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall
      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
        0        7       13        21         0       211         0         0       211     0.0       0.0        10.1        0        0        0        0      113       0.0
        1       79      157        88       993       989       198       795       194     9.0      11.3        11.2      106      405      502       97       14       0.0
        2       19       36         5        63        63        37        27        36     2.4      12.3        12.2       19       14       32       18       12       0.0
      >>>>>>>>>>>>>>>>>>>>>>>>> text below has been is new and/or reformatted
      Uptime(secs): 122.2 total, 0.9 interval
      Compaction IO cumulative (GB): 0.21 new, 1.03 read, 1.23 write, 2.26 read+write
      Compaction IO cumulative (MB/sec): 1.7 new, 8.6 read, 10.3 write, 19.0 read+write
      Amplification cumulative: 6.0 write, 11.0 compaction
      Compaction IO interval (MB): 5.59 new, 0.00 read, 5.59 write, 5.59 read+write
      Compaction IO interval (MB/sec): 6.5 new, 0.0 read, 6.5 write, 6.5 read+write
      Amplification interval: 1.0 write, 1.0 compaction
      >>>>>>>>>>>>>>>>>>>>>>>> text above is new and/or reformatted
      Stalls(secs): 90.574 level0_slowdown, 0.000 level0_numfiles, 10.165 memtable_compaction, 0.000 leveln_slowdown
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      make check, run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: haobo
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D11049
      d9f538e1
    • H
      [RocksDB] Add score column to leveldb.stats · 2b1fb5b0
      Haobo Xu 提交于
      Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions...
      
      Test Plan: make check; db_bench
      
      Reviewers: dhruba
      
      CC: leveldb, MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D11025
      2b1fb5b0
  11. 02 6月, 2013 1 次提交
    • H
      [RocksDB] Introduce Fast Mutex option · d897d33b
      Haobo Xu 提交于
      Summary:
      This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
      Quickly tested 8 thread cpu bound 100 byte random read.
      No fast mutex: ~750k/s ops
      With fast mutex: ~880k/s ops
      
      Test Plan: make check; db_bench; db_stress
      
      Reviewers: dhruba
      
      CC: MarkCallaghan, leveldb
      
      Differential Revision: https://reviews.facebook.net/D11031
      d897d33b
  12. 31 5月, 2013 1 次提交
    • H
      [RocksDB] [Performance] Allow different posix advice to be applied to the same table file · ab8d2f6a
      Haobo Xu 提交于
      Summary:
      Current posix advice implementation ties up the access pattern hint with the creation of a file.
      It is not possible to apply different advice for different access (random get vs compaction read),
      without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
      to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
      table file based on when/how the file is used.
      Two options are added to set the access hint, after the file is first opened and after the file is being
      compacted.
      
      Test Plan: make check; db_stress; db_bench
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: MarkCallaghan, leveldb
      
      Differential Revision: https://reviews.facebook.net/D10905
      ab8d2f6a
  13. 30 5月, 2013 1 次提交
  14. 29 5月, 2013 2 次提交
  15. 25 5月, 2013 3 次提交
  16. 24 5月, 2013 4 次提交
  17. 22 5月, 2013 4 次提交
    • V
      [Kill randomly at various points in source code for testing] · 760dd475
      Vamsi Ponnekanti 提交于
      Summary:
      This is initial version. A few ways in which this could
      be extended in the future are:
      (a) Killing from more places in source code
      (b) Hashing stack and using that hash in determining whether to crash.
          This is to avoid crashing more often at source lines that are executed
          more often.
      (c) Raising exceptions or returning errors instead of killing
      
      Test Plan:
      This whole thing is for testing.
      
      Here is part of output:
      
      python2.7 tools/db_crashtest2.py -d 600
      Running db_stress
      
      db_stress retncode -15 output LevelDB version     : 1.5
      Number of threads   : 32
      Ops per thread      : 10000000
      Read percentage     : 50
      Write-buffer-size   : 4194304
      Delete percentage   : 30
      Max key             : 1000
      Ratio #ops/#keys    : 320000
      Num times DB reopens: 0
      Batches/snapshots   : 1
      Purge redundant %   : 50
      Num keys per lock   : 4
      Compression         : snappy
      ------------------------------------------------
      No lock creation because test_batches_snapshots set
      2013/04/26-17:55:17  Starting database operations
      Created bg thread 0x7fc1f07ff700
      ... finished 60000 ops
      Running db_stress
      
      db_stress retncode -15 output LevelDB version     : 1.5
      Number of threads   : 32
      Ops per thread      : 10000000
      Read percentage     : 50
      Write-buffer-size   : 4194304
      Delete percentage   : 30
      Max key             : 1000
      Ratio #ops/#keys    : 320000
      Num times DB reopens: 0
      Batches/snapshots   : 1
      Purge redundant %   : 50
      Num keys per lock   : 4
      Compression         : snappy
      ------------------------------------------------
      Created bg thread 0x7ff0137ff700
      No lock creation because test_batches_snapshots set
      2013/04/26-17:56:15  Starting database operations
      ... finished 90000 ops
      
      Revert Plan: OK
      
      Task ID: #2252691
      
      Reviewers: dhruba, emayanke
      
      Reviewed By: emayanke
      
      CC: leveldb, haobo
      
      Differential Revision: https://reviews.facebook.net/D10581
      760dd475
    • H
      [RocksDB] Introduce an option to skip log error on recovery · 87d0af15
      Haobo Xu 提交于
      Summary:
      Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
      If client is ok with losing most recent updates, we could simply skip those errors.
      However, it's important to introduce an additional flag, so that paranoid_check can
      still guard against more serious problems.
      
      Test Plan: make check; db_stress
      
      Reviewers: dhruba, emayanke
      
      Reviewed By: emayanke
      
      CC: leveldb, emayanke
      
      Differential Revision: https://reviews.facebook.net/D10869
      87d0af15
    • D
      Ability to set different size fanout multipliers for every level. · d1aaaf71
      Dhruba Borthakur 提交于
      Summary:
      There is an existing field Options.max_bytes_for_level_multiplier that
      sets the multiplier for the size of each level in the database.
      
      This patch introduces the ability to set different multipliers
      for every level in the database. The size of a level is determined
      by using both max_bytes_for_level_multiplier as well as the
      per-level fanout.
      
      size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
                         * fanout[i-1]
      
      The default value of fanout is 1, so that it is backward compatible.
      
      Test Plan: make check
      
      Reviewers: haobo, emayanke
      
      Reviewed By: emayanke
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10863
      d1aaaf71
    • H
      [RocksDB] [Performance Bug] MemTable::Get Slow · c3c13db3
      Haobo Xu 提交于
      Summary:
      The merge operator diff introduced a performance problem in MemTable::Get.
      An exit condition is missed when the current key does not match the user key.
      This could lead to full memtable scan if the user key is not found.
      
      Test Plan: make check; db_bench
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D10851
      c3c13db3