1. 08 11月, 2012 1 次提交
    • D
      Avoid doing a exhaustive search when looking for overlapping files. · 9b87a2ba
      Dhruba Borthakur 提交于
      Summary:
      The Version::GetOverlappingInputs() is called multiple times in
      the compaction code path. Eack invocation does a binary search
      for overlapping files in the specified key range.
      This patch remembers the offset of an overlapped file when
      GetOverlappingInputs() is called the first time within
      a compaction run. Suceeding calls to GetOverlappingInputs()
      uses the remembered index to avoid the binary search.
      
      I measured that 1000 iterations of GetOverlappingInputs
      takes around 4500 microseconds without this patch. If I use
      this patch with the hint on every invocation, then 1000
      iterations take about 3900 microsecond.
      
      Test Plan: make check OPT=-g
      
      Reviewers: heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: MarkCallaghan, emayanke, sheki
      
      Differential Revision: https://reviews.facebook.net/D6513
      9b87a2ba
  2. 07 11月, 2012 2 次提交
  3. 06 11月, 2012 4 次提交
    • D
      The method GetOverlappingInputs should use binary search. · cb7a0022
      Dhruba Borthakur 提交于
      Summary:
      The method Version::GetOverlappingInputs used a sequential search
      to map a kay-range to a set of files. But the files are arranged
      in ascending order of key, so a biary search is more effective.
      
      This patch implements Version::GetOverlappingInputsBinarySearch
      that finds one file that corresponds to the specified key range
      and then iterates backwards and forwards to find all overlapping
      files.
      
      This patch is critical for making compactions efficient, especially
      when there are thousands of files in a single level.
      
      I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes
      takes 16000 microseconds without this patch. With this patch, the
      same method takes about 4600 microseconds.
      
      Test Plan: Almost all unit tests in db_test uses this method to lookup keys.
      
      Reviewers: heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: MarkCallaghan, emayanke, sheki
      
      Differential Revision: https://reviews.facebook.net/D6465
      cb7a0022
    • D
      Ability to invoke application hook for every key during compaction. · 5273c814
      Dhruba Borthakur 提交于
      Summary:
      There are certain use-cases where the application intends to
      delete older keys aftre they have expired a certian time period.
      One option for those applications is to periodically scan the
      entire database and delete appropriate keys.
      
      A better way is to allow the application to hook into the
      compaction process. This patch allows the application to set
      a method callback for every key that is being compacted. If
      this method returns true, then the key is not preserved in
      the output of the compaction.
      
      Test Plan:
      This is mostly to preview the proposed new public api.
      Since it is a public api, please do due diligence on reviewing it.
      
      I will be writing test cases for this api in mynext version of
      this patch.
      
      Reviewers: MarkCallaghan, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: sheki, adsharma
      
      Differential Revision: https://reviews.facebook.net/D6285
      5273c814
    • H
      fix complie error · f1a7c735
      heyongqiang 提交于
      Summary:
      
      as subject
      
      Test Plan:n/a
      f1a7c735
    • H
      Add a tool to change number of levels · d55c2ba3
      heyongqiang 提交于
      Summary: as subject.
      
      Test Plan: manually test it, will add a testcase
      
      Reviewers: dhruba, MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6345
      d55c2ba3
  4. 05 11月, 2012 1 次提交
  5. 03 11月, 2012 1 次提交
  6. 02 11月, 2012 1 次提交
  7. 30 10月, 2012 7 次提交
    • M
      Use timer to measure sleep rather than assume it is 1000 usecs · 3e7e2692
      Mark Callaghan 提交于
      Summary:
      This makes the stall timers in MakeRoomForWrite more accurate by timing
      the sleeps. From looking at the logs the real sleep times are usually
      about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are:
      2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger
      2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench, look at DB/LOG
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D6297
      3e7e2692
    • H
      fix test failure · fb8d4373
      heyongqiang 提交于
      Summary: as subject
      
      Test Plan: db_test
      
      Reviewers: dhruba, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6309
      fb8d4373
    • H
      add a test case to make sure chaning num_levels will fail Summary: · 925f60d3
      heyongqiang 提交于
      Summary: as subject
      
      Test Plan: db_test
      
      Reviewers: dhruba, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6303
      925f60d3
    • D
      Allow having different compression algorithms on different levels. · 321dfdc3
      Dhruba Borthakur 提交于
      Summary:
      The leveldb API is enhanced to support different compression algorithms at
      different levels.
      
      This adds the option min_level_to_compress to db_bench that specifies
      the minimum level for which compression should be done when
      compression is enabled. This can be used to disable compression for levels
      0 and 1 which are likely to suffer from stalls because of the CPU load
      for memtable flushes and (L0,L1) compaction.  Level 0 is special as it
      gets frequent memtable flushes. Level 1 is special as it frequently
      gets all:all file compactions between it and level 0. But all other levels
      could be the same. For any level N where N > 1, the rate of sequential
      IO for that level should be the same. The last level is the
      exception because it might not be full and because files from it are
      not read to compact with the next larger level.
      
      The same amount of time will be spent doing compaction at any
      level N excluding N=0, 1 or the last level. By this standard all
      of those levels should use the same compression. The difference is that
      the loss (using more disk space) from a faster compression algorithm
      is less significant for N=2 than for N=3. So we might be willing to
      trade disk space for faster write rates with no compression
      for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
      algorithm for the mid levels also allows us to reclaim some cpu
      without trading off much loss in disk space overhead.
      
      Also note that little is to be gained by compressing levels 0 and 1. For
      a 4-level tree they account for 10% of the data. For a 5-level tree they
      account for 1% of the data.
      
      With compression enabled:
      * memtable flush rate is ~18MB/second
      * (L0,L1) compaction rate is ~30MB/second
      
      With compression enabled but min_level_to_compress=2
      * memtable flush rate is ~320MB/second
      * (L0,L1) compaction rate is ~560MB/second
      
      This practicaly takes the same code from https://reviews.facebook.net/D6225
      but makes the leveldb api more general purpose with a few additional
      lines of code.
      
      Test Plan: make check
      
      Differential Revision: https://reviews.facebook.net/D6261
      321dfdc3
    • M
      Add more rates to db_bench output · acc8567b
      Mark Callaghan 提交于
      Summary:
      Adds the "MB/sec in" and "MB/sec out" to this line:
      Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out
      
      Changes all values to be reported per interval and since test start for this line:
      ... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D6291
      acc8567b
    • D
      Fix unit test failure caused by delaying deleting obsolete files. · de7689b1
      Dhruba Borthakur 提交于
      Summary:
      A previous commit 4c107587 introduced
      the idea that some version updates might not delete obsolete files.
      This means that if a unit test blindly counts the number of files
      in the db directory it might not represent the true state of the database.
      
      Use GetLiveFiles() insteads to count the number of live files in the database.
      
      Test Plan:
      make check
      de7689b1
    • M
      Adds DB::GetNextCompaction and then uses that for rate limiting db_bench · 70c42bf0
      Mark Callaghan 提交于
      Summary:
      Adds a method that returns the score for the next level that most
      needs compaction. That method is then used by db_bench to rate limit threads.
      Threads are put to sleep at the end of each stats interval until the score
      is less than the limit. The limit is set via the --rate_limit=$double option.
      The specified value must be > 1.0. Also adds the option --stats_per_interval
      to enable additional metrics reported every stats interval.
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D6243
      70c42bf0
  8. 27 10月, 2012 4 次提交
  9. 25 10月, 2012 1 次提交
    • M
      Improve statistics · e7206f43
      Mark Callaghan 提交于
      Summary:
      This adds more statistics to be reported by GetProperty("leveldb.stats").
      The new stats include time spent waiting on stalls in MakeRoomForWrite.
      This also includes the total amplification rate where that is:
          (#bytes of sequential IO during compaction) / (#bytes from Put)
      This also includes a lot more data for the per-level compaction report.
      * Rn(MB) - MB read from level N during compaction between levels N and N+1
      * Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1
      * Wnew(MB) - new data written to the level during compaction
      * Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB)
      * Rn - files read from level N during compaction between levels N and N+1
      * Rnp1 - files read from level N+1 during compaction between levels N and N+1
      * Wnp1 - files written to level N+1 during compaction between levels N and N+1
      * NewW - new files written to level N+1 during compaction
      * Count - number of compactions done for this level
      
      This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB)
      
                                     Compactions
      Level  Files Size(MB) Time(sec) Read(MB) Write(MB)  Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s)   Rn Rnp1 Wnp1 NewW Count
      -------------------------------------------------------------------------------------------------------------------------------------
        0        3        6        33        0       576       0        0      576    -1.0       0.0         1.3     0    0    0    0   290
        1      127      242       351     5316      5314     570     4747      567    17.0      12.1        12.1   287 2399 2685  286    32
        2      161      328        54      822       824     326      496      328     4.0       1.9         1.9   160  251  411  160   161
      Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out
      Uptime(secs): 439.8
      Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      (cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8)
      
      Reviewers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D6153
      e7206f43
  10. 24 10月, 2012 1 次提交
    • M
      Fix broken build. Add stdint.h to get uint64_t · 51d2adfb
      Mark Callaghan 提交于
      Summary:
      I still get failures from this. Not sure whether there was a fix in progress.
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      compile
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D6147
      51d2adfb
  11. 23 10月, 2012 2 次提交
  12. 20 10月, 2012 2 次提交
    • D
      Do not enable checksums for zlib compression. · 507f5aac
      Dhruba Borthakur 提交于
      Summary:
      Leveldb code already calculates checksums for each block. There
      is no need to generate checksums inside zlib. This patch switches-off checksum generation/checking in zlib library.
      
      (The Inno support for zlib uses windowsBits=14 as well.)
      
      pfabricator marks this file as binary. But here is the diff
      
      diff --git a/port/port_posix.h b/port/port_posix.h
      index 86a0927..db4e0b8 100644
      --- a/port/port_posix.h
      +++ b/port/port_posix.h
      @@ -163,7 +163,7 @@ inline bool Snappy_Uncompress(const char* input, size_t length,
       }
      
       inline bool Zlib_Compress(const char* input, size_t length,
      -    ::std::string* output, int windowBits = 15, int level = -1,
      +    ::std::string* output, int windowBits = -14, int level = -1,
            int strategy = 0) {
       #ifdef ZLIB
         // The memLevel parameter specifies how much memory should be allocated for
      @@ -223,7 +223,7 @@ inline bool Zlib_Compress(const char* input, size_t length,
       }
      
       inline char* Zlib_Uncompress(const char* input_data, size_t input_length,
      -    int* decompress_size, int windowBits = 15) {
      +    int* decompress_size, int windowBits = -14) {
       #ifdef ZLIB
         z_stream _stream;
         memset(&_stream, 0, sizeof(z_stream));
      
      Test Plan: run db_bench with zlib compression.
      
      Reviewers: heyongqiang, MarkCallaghan
      
      Reviewed By: heyongqiang
      
      Differential Revision: https://reviews.facebook.net/D6105
      507f5aac
    • D
      db_bench was not correctly initializing the value for delete_obsolete_files_period_micros option. · cf5adc80
      Dhruba Borthakur 提交于
      Summary:
      The parameter delete_obsolete_files_period_micros controls the
      periodicity of deleting obsolete files. db_bench was reading in
      this parameter intoa local variable called 'l' but was incorrectly
      using another local variable called 'n' while setting it in the
      db.options data structure.
      This patch also logs the value of delete_obsolete_files_period_micros
      in the LOG file at db startup time.
      
      I am hoping that this will improve the overall write throughput drastically.
      
      Test Plan: run db_bench
      
      Reviewers: MarkCallaghan, heyongqiang
      
      Reviewed By: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6099
      cf5adc80
  13. 18 10月, 2012 1 次提交
  14. 17 10月, 2012 1 次提交
    • D
      The deletion of obsolete files should not occur very frequently. · aa73538f
      Dhruba Borthakur 提交于
      Summary:
      The method DeleteObsolete files is a very costly methind, especially
      when the number of files in a system is large. It makes a list of
      all live-files and then scans the directory to compute the diff.
      By default, this method is executed after every compaction run.
      
      This patch makes it such that DeleteObsolete files is never
      invoked twice within a configured period.
      
      Test Plan: run all unit tests
      
      Reviewers: heyongqiang, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6045
      aa73538f
  15. 16 10月, 2012 1 次提交
  16. 13 10月, 2012 1 次提交
  17. 11 10月, 2012 1 次提交
    • A
      [tools] Add a tool to stress test concurrent writing to levelDB · 24f7983b
      Asad K Awan 提交于
      Summary:
      Created a tool that runs multiple threads that concurrently read and write to levelDB.
      All writes to the DB are stored in an in-memory hashtable and verified at the end of the
      test. All writes for a given key are serialzied.
      
      Test Plan:
       - Verified by writing only a few keys and logging all writes and verifying that values read and written are correct.
       - Verified correctness of value generator.
       - Ran with various parameters of number of keys, locks, and threads.
      
      Reviewers: dhruba, MarkCallaghan, heyongqiang
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D5829
      24f7983b
  18. 06 10月, 2012 2 次提交
  19. 04 10月, 2012 4 次提交
    • D
      Implement RowLocks for assoc schema · f7975ac7
      Dhruba Borthakur 提交于
      Summary:
      Each assoc is identified by (id1, assocType). This is the rowkey.
      Each row has a read/write rowlock. There is statically allocated array
      of 2000 read/write locks. A rowkey is murmur-hashed to one of the
      read/write locks.
      
      assocPut and assocDelete acquires the rowlock in Write mode.
      The key-updates are done within the rowlock with a atomic nosync
      batch write to leveldb. Then the rowlock is released and
      a write-with-sync is done to sync leveldb transaction log.
      
      Test Plan: added unit test
      
      Reviewers: heyongqiang
      
      Reviewed By: heyongqiang
      
      Differential Revision: https://reviews.facebook.net/D5859
      f7975ac7
    • D
      An configurable option to write data using write instead of mmap. · c1006d42
      Dhruba Borthakur 提交于
      Summary:
      We have seen that reading data via the pread call (instead of
      mmap) is much faster on Linux 2.6.x kernels. This patch makes
      an equivalent option to switch off mmaps for the write path
      as well.
      
      db_bench --mmap_write=0 will use write() instead of mmap() to
      write data to a file.
      
      This change is backward compatible, the default
      option is to continue using mmap for writing to a file.
      
      Test Plan: "make check all"
      
      Differential Revision: https://reviews.facebook.net/D5781
      c1006d42
    • M
      Add --stats_interval option to db_bench · e678a594
      Mark Callaghan 提交于
      Summary:
      The option is zero by default and in that case reporting is unchanged.
      By unchanged, the interval at which stats are reported is scaled after each
      report and newline is not issued after each report so one line is rewritten.
      When non-zero it specifies the constant interval (in operations) at which
      statistics are reported and the stats include the rate per interval. This
      makes it easier to determine whether QPS changes over the duration of the test.
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: heyongqiang
      
      Differential Revision: https://reviews.facebook.net/D5817
      e678a594
    • M
      Fix the bounds check for the --readwritepercent option · d8763abe
      Mark Callaghan 提交于
      Summary:
      see above
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      run db_bench with invalid value for option
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: heyongqiang
      
      Differential Revision: https://reviews.facebook.net/D5823
      d8763abe
  20. 03 10月, 2012 1 次提交
    • M
      Fix compiler warnings and errors in ldb.c · 98804f91
      Mark Callaghan 提交于
      Summary:
      stdlib.h is needed for exit()
      --readhead --> --readahead
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      compile
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      fix compiler warnings & errors
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: heyongqiang
      
      Differential Revision: https://reviews.facebook.net/D5805
      98804f91
  21. 02 10月, 2012 1 次提交