1. 29 3月, 2013 1 次提交
    • A
      Use non-mmapd files for Write-Ahead Files · 7fdd5f5b
      Abhishek Kona 提交于
      Summary:
      Use non mmapd files for Write-Ahead log.
      Earlier use of MMaped files. made the log iterator read ahead and miss records.
      Now the reader and writer will point to the same physical location.
      
      There is no perf regression :
      ./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
      with This diff :
      fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
      without this dif :
      fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s
      
      Test Plan: unit test included
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9741
      7fdd5f5b
  2. 22 3月, 2013 2 次提交
  3. 21 3月, 2013 1 次提交
    • D
      Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. · ad96563b
      Dhruba Borthakur 提交于
      Summary:
      This patch allows an application to specify whether to use bufferedio,
      reads-via-mmaps and writes-via-mmaps per database. Earlier, there
      was a global static variable that was used to configure this functionality.
      
      The default setting remains the same (and is backward compatible):
       1. use bufferedio
       2. do not use mmaps for reads
       3. use mmap for writes
       4. use readaheads for reads needed for compaction
      
      I also added a parameter to db_bench to be able to explicitly specify
      whether to do readaheads for compactions or not.
      
      Test Plan: make check
      
      Reviewers: sheki, heyongqiang, MarkCallaghan
      
      Reviewed By: sheki
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9429
      ad96563b
  4. 20 3月, 2013 2 次提交
  5. 07 3月, 2013 1 次提交
    • A
      Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file · d68880a1
      Abhishek Kona 提交于
      Summary:
      Store the last flushed, seq no. in db_impl. Check against it in
      transaction Log iterator. Do not attempt to read ahead if we do not know
      if the data is flushed completely.
      Does not work if flush is disabled. Any ideas on fixing that?
      * Minor change, iter->Next is called the first time automatically for
      * the first time.
      
      Test Plan:
      existing test pass.
      More ideas on testing this?
      Planning to run some stress test.
      
      Reviewers: dhruba, heyongqiang
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D9087
      d68880a1
  6. 04 3月, 2013 2 次提交
    • M
      Add rate_delay_limit_milliseconds · 993543d1
      Mark Callaghan 提交于
      Summary:
      This adds the rate_delay_limit_milliseconds option to make the delay
      configurable in MakeRoomForWrite when the max compaction score is too high.
      This delay is called the Ln slowdown. This change also counts the Ln slowdown
      per level to make it possible to see where the stalls occur.
      
      From IO-bound performance testing, the Level N stalls occur:
      * with compression -> at the largest uncompressed level. This makes sense
                            because compaction for compressed levels is much
                            slower. When Lx is uncompressed and Lx+1 is compressed
                            then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                            compaction process is the first to be slowed by
                            compression.
      * without compression -> at level 1
      
      Task ID: #1832108
      
      Blame Rev:
      
      Test Plan:
      run with real data, added test
      
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D9045
      993543d1
    • D
      Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0. · 806e2643
      Dhruba Borthakur 提交于
      Summary:
      Rocks accumulates recent writes and deletes in the in-memory memtable.
      When the memtable is full, it writes the contents on the memtable to
      a file in L0.
      
      This patch removes redundant records at the time of the flush. If there
      are multiple versions of the same key in the memtable, then only the
      most recent one is dumped into the output file. The purging of
      redundant records occur only if the most recent snapshot is earlier
      than the earliest record in the memtable.
      
      Should we switch on this feature by default or should we keep this feature
      turned off in the default settings?
      
      Test Plan: Added test case to db_test.cc
      
      Reviewers: sheki, vamsi, emayanke, heyongqiang
      
      Reviewed By: sheki
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D8991
      806e2643
  7. 01 3月, 2013 1 次提交
  8. 19 2月, 2013 1 次提交
  9. 26 1月, 2013 1 次提交
    • C
      Fix poor error on num_levels mismatch and few other minor improvements · 0b83a831
      Chip Turner 提交于
      Summary:
      Previously, if you opened a db with num_levels set lower than
      the database, you received the unhelpful message "Corruption:
      VersionEdit: new-file entry."  Now you get a more verbose message
      describing the issue.
      
      Also, fix handling of compression_levels (both the run-over-the-end
      issue and the memory management of it).
      
      Lastly, unique_ptr'ify a couple of minor calls.
      
      Test Plan: make check
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D8151
      0b83a831
  10. 25 1月, 2013 1 次提交
    • C
      Use fallocate to prevent excessive allocation of sst files and logs · 3dafdfb2
      Chip Turner 提交于
      Summary:
      On some filesystems, pre-allocation can be a considerable
      amount of space.  xfs in our production environment pre-allocates by
      1GB, for instance.  By using fallocate to inform the kernel of our
      expected file sizes, we eliminate this wasteage (that isn't recovered
      until the file is closed which, in the case of LOG files, can be a
      considerable amount of time).
      
      Test Plan:
      created an xfs loopback filesystem, mounted with
      allocsize=4M, and ran db_stress.  LOG file without this change was 4M,
      and with it it was 128k then grew to normal size.
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: adsharma, leveldb
      
      Differential Revision: https://reviews.facebook.net/D7953
      3dafdfb2
  11. 24 1月, 2013 1 次提交
    • C
      Fix a number of object lifetime/ownership issues · 2fdf91a4
      Chip Turner 提交于
      Summary:
      Replace manual memory management with std::unique_ptr in a
      number of places; not exhaustive, but this fixes a few leaks with file
      handles as well as clarifies semantics of the ownership of file handles
      with log classes.
      
      Test Plan: db_stress, make check
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: zshao, leveldb, heyongqiang
      
      Differential Revision: https://reviews.facebook.net/D8043
      2fdf91a4
  12. 17 1月, 2013 1 次提交
    • A
      rollover manifest file. · 7d5a4383
      Abhishek Kona 提交于
      Summary:
      Check in LogAndApply if the file size is more than the limit set in
      Options.
      Things to consider : will this be expensive?
      
      Test Plan: make all check. Inputs on a new unit test?
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D7701
      7d5a4383
  13. 11 1月, 2013 1 次提交
  14. 18 12月, 2012 1 次提交
    • K
      Added meta-database support. · 62d48571
      Kosie van der Merwe 提交于
      Summary:
      Added kMetaDatabase for meta-databases in db/filename.h along with supporting
      fuctions.
      Fixed switch in DBImpl so that it also handles kMetaDatabase.
      Fixed DestroyDB() that it can handle destroying meta-databases.
      
      Test Plan: make check
      
      Reviewers: sheki, emayanke, vamsi, dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D7245
      62d48571
  15. 13 12月, 2012 1 次提交
    • A
      GetSequence API in write batch. · 2ba866e0
      Abhishek Kona 提交于
      Summary:
      WriteBatch is now used by the GetUpdatesSinceAPI. This API is external
      and will be used by the rocks server. Rocks Server and others will need
      to know about the Sequence Number in the WriteBatch. This public method
      will allow for that.
      
      Test Plan: make all check.
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D7293
      2ba866e0
  16. 12 12月, 2012 1 次提交
  17. 11 12月, 2012 2 次提交
  18. 08 12月, 2012 1 次提交
    • A
      GetUpdatesSince API to enable replication. · 80550089
      Abhishek Kona 提交于
      Summary:
      How it works:
      * GetUpdatesSince takes a SequenceNumber.
      * A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found.
      * Seek in the logFile till the requested SeqNumber is found.
      * Return an iterator which contains logic to return record's one by one.
      
      Test Plan:
      * Test case included to check the good code path.
      * Will update with more test-cases.
      * Feedback required on test-cases.
      
      Reviewers: dhruba, emayanke
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D7119
      80550089
  19. 29 11月, 2012 3 次提交
    • S
      Move WAL files to archive directory, instead of deleting. · d4627e6d
      sheki 提交于
      Summary:
      Create a directory "archive" in the DB directory.
      During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
      instead of deleting.
      
      Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      Differential Revision: https://reviews.facebook.net/D6975
      d4627e6d
    • A
      Fix all the lint errors. · d29f1819
      Abhishek Kona 提交于
      Summary:
      Scripted and removed all trailing spaces and converted all tabs to
      spaces.
      
      Also fixed other lint errors.
      All lint errors from this point of time should be taken seriously.
      
      Test Plan: make all check
      
      Reviewers: dhruba
      
      Reviewed By: dhruba
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D7059
      d29f1819
    • D
      Delete non-visible keys during a compaction even in the presense of snapshots. · 9a357847
      Dhruba Borthakur 提交于
      Summary:
       LevelDB should delete almost-new keys when a long-open snapshot exists.
      The previous behavior is to keep all versions that were created after the
      oldest open snapshot. This can lead to database size bloat for
      high-update workloads when there are long-open snapshots and long-open
      snapshot will be used for logical backup. By "almost new" I mean that the
      key was updated more than once after the oldest snapshot.
      
      If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if
      we find two instances of the same key k1 that lie entirely within s1 and
      s2 (i.e. s1 < k1 < s2), then the earlier version
      of k1 can be safely deleted because that version is not visible in any snapshot.
      
      Test Plan:
      unit test attached
      make clean check
      
      Differential Revision: https://reviews.facebook.net/D6999
      9a357847
  20. 20 11月, 2012 2 次提交
  21. 17 11月, 2012 1 次提交
    • A
      Fix a coding error in db_test.cc · de278a6d
      amayank 提交于
      Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not
      
      Test Plan: make all;./db_test
      
      Reviewers: dhruba, heyongqiang
      
      Reviewed By: dhruba
      
      CC: sheki
      
      Differential Revision: https://reviews.facebook.net/D6771
      de278a6d
  22. 14 11月, 2012 1 次提交
  23. 07 11月, 2012 2 次提交
  24. 06 11月, 2012 1 次提交
    • D
      Ability to invoke application hook for every key during compaction. · 5273c814
      Dhruba Borthakur 提交于
      Summary:
      There are certain use-cases where the application intends to
      delete older keys aftre they have expired a certian time period.
      One option for those applications is to periodically scan the
      entire database and delete appropriate keys.
      
      A better way is to allow the application to hook into the
      compaction process. This patch allows the application to set
      a method callback for every key that is being compacted. If
      this method returns true, then the key is not preserved in
      the output of the compaction.
      
      Test Plan:
      This is mostly to preview the proposed new public api.
      Since it is a public api, please do due diligence on reviewing it.
      
      I will be writing test cases for this api in mynext version of
      this patch.
      
      Reviewers: MarkCallaghan, heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: sheki, adsharma
      
      Differential Revision: https://reviews.facebook.net/D6285
      5273c814
  25. 05 11月, 2012 1 次提交
  26. 03 11月, 2012 1 次提交
  27. 30 10月, 2012 4 次提交
    • H
      fix test failure · fb8d4373
      heyongqiang 提交于
      Summary: as subject
      
      Test Plan: db_test
      
      Reviewers: dhruba, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6309
      fb8d4373
    • H
      add a test case to make sure chaning num_levels will fail Summary: · 925f60d3
      heyongqiang 提交于
      Summary: as subject
      
      Test Plan: db_test
      
      Reviewers: dhruba, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D6303
      925f60d3
    • D
      Allow having different compression algorithms on different levels. · 321dfdc3
      Dhruba Borthakur 提交于
      Summary:
      The leveldb API is enhanced to support different compression algorithms at
      different levels.
      
      This adds the option min_level_to_compress to db_bench that specifies
      the minimum level for which compression should be done when
      compression is enabled. This can be used to disable compression for levels
      0 and 1 which are likely to suffer from stalls because of the CPU load
      for memtable flushes and (L0,L1) compaction.  Level 0 is special as it
      gets frequent memtable flushes. Level 1 is special as it frequently
      gets all:all file compactions between it and level 0. But all other levels
      could be the same. For any level N where N > 1, the rate of sequential
      IO for that level should be the same. The last level is the
      exception because it might not be full and because files from it are
      not read to compact with the next larger level.
      
      The same amount of time will be spent doing compaction at any
      level N excluding N=0, 1 or the last level. By this standard all
      of those levels should use the same compression. The difference is that
      the loss (using more disk space) from a faster compression algorithm
      is less significant for N=2 than for N=3. So we might be willing to
      trade disk space for faster write rates with no compression
      for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
      algorithm for the mid levels also allows us to reclaim some cpu
      without trading off much loss in disk space overhead.
      
      Also note that little is to be gained by compressing levels 0 and 1. For
      a 4-level tree they account for 10% of the data. For a 5-level tree they
      account for 1% of the data.
      
      With compression enabled:
      * memtable flush rate is ~18MB/second
      * (L0,L1) compaction rate is ~30MB/second
      
      With compression enabled but min_level_to_compress=2
      * memtable flush rate is ~320MB/second
      * (L0,L1) compaction rate is ~560MB/second
      
      This practicaly takes the same code from https://reviews.facebook.net/D6225
      but makes the leveldb api more general purpose with a few additional
      lines of code.
      
      Test Plan: make check
      
      Differential Revision: https://reviews.facebook.net/D6261
      321dfdc3
    • D
      Fix unit test failure caused by delaying deleting obsolete files. · de7689b1
      Dhruba Borthakur 提交于
      Summary:
      A previous commit 4c107587 introduced
      the idea that some version updates might not delete obsolete files.
      This means that if a unit test blindly counts the number of files
      in the db directory it might not represent the true state of the database.
      
      Use GetLiveFiles() insteads to count the number of live files in the database.
      
      Test Plan:
      make check
      de7689b1
  28. 26 10月, 2012 1 次提交
  29. 29 9月, 2012 1 次提交
    • D
      Trigger read compaction only if seeks to storage are incurred. · c1bb32e1
      Dhruba Borthakur 提交于
      Summary:
      In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied.
      
      This patch counts a 'seek' only if the file fails the bloom filter
      check and has to read in data block(s) from the storage.
      
      This patch also counts a 'seek' if a file is not present in the file-cache, because opening a file means that its index blocks need to be read into cache.
      
      Test Plan: unit test attached. I will probably add one more unti tests.
      
      Reviewers: heyongqiang
      
      Reviewed By: heyongqiang
      
      CC: MarkCallaghan
      
      Differential Revision: https://reviews.facebook.net/D5709
      c1bb32e1