1. 09 6月, 2015 1 次提交
    • I
      Use nullptr for default compaction_filter_factory · 643bbbf0
      Islam AbdelRahman 提交于
      Summary:
      Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
      The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI
      
      Test Plan: make check
      
      Reviewers: yoshinorim, ott, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39693
      643bbbf0
  2. 06 6月, 2015 1 次提交
  3. 02 6月, 2015 1 次提交
  4. 30 5月, 2015 1 次提交
    • A
      Optimistic Transactions · dc9d70de
      agiardullo 提交于
      Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.
      
      Test Plan: Added a new test, but still need to look into stress testing.
      
      Reviewers: yhchiang, igor, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: adamretter, MarkCallaghan, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D33435
      dc9d70de
  5. 29 5月, 2015 2 次提交
    • A
      Support saving history in memtable_list · c8153510
      agiardullo 提交于
      Summary:
      For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
      
      After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
      
      This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
      
      However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
      
      Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.
      
      Reviewers: sdong, rven, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D37443
      c8153510
    • Y
      [API Change] Move listeners from ColumnFamilyOptions to DBOptions · 672dda9b
      Yueh-Hsuan Chiang 提交于
      Summary: Move listeners from ColumnFamilyOptions to DBOptions
      
      Test Plan:
      listener_test
      compact_files_test
      
      Reviewers: rven, anthony, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D39087
      672dda9b
  6. 22 5月, 2015 2 次提交
  7. 20 5月, 2015 2 次提交
  8. 19 5月, 2015 1 次提交
  9. 25 4月, 2015 1 次提交
    • A
      Task 6532943: Rocksdb - SetCapacity() can dynamically change cache capacity if feasible · 794ccfde
      Aashish Pant 提交于
      Summary:
      When new capacity is larger than existing capacity, simply update the capacity to the new valie
      When new capacity is less than existing capacity, but more than the usage, simply update the capacity to new value
      When new capacity is less than the existing capacity and existing usage both, try to purge entries in LRU if feasible to make usage < capacity
      
      Test Plan: Created unit tests in cache_test.cc
      
      Reviewers: sdong, rven, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D37527
      794ccfde
  10. 07 4月, 2015 1 次提交
    • S
      A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge · 953a885e
      sdong 提交于
      Summary:
      Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
      
      Also refactor the codes so that
      (1) make table property collector and internal table property collector two separate data structures with the later one now exposed
      (2) table builders only receive internal table properties
      
      Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
      
      Reviewers: yhchiang, igor.sugak, rven, igor
      
      Reviewed By: rven, igor
      
      Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D35373
      953a885e
  11. 31 3月, 2015 2 次提交
    • S
      Universal Compactions with Small Files · b23bbaa8
      sdong 提交于
      Summary:
      With this change, we use L1 and up to store compaction outputs in universal compaction.
      The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.
      
      If options.num_levels=1, it behaves all the same as now.
      
      Test Plan:
      1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
      2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
      3) add unit test to migrate from multiple level setting back to one level setting
      4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.
      
      Reviewers: rven, kradhakrishnan, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: meyering, leveldb, MarkCallaghan, dhruba
      
      Differential Revision: https://reviews.facebook.net/D34539
      b23bbaa8
    • I
      db_bench can now disable flashcache for background threads · d61cb0b9
      Igor Canadi 提交于
      Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code.
      
      Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p
      
      Reviewers: MarkCallaghan, sdong
      
      Reviewed By: sdong
      
      Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D35391
      d61cb0b9
  12. 26 3月, 2015 1 次提交
  13. 25 3月, 2015 1 次提交
  14. 20 3月, 2015 1 次提交
    • I
      Don't delete files when column family is dropped · b088c83e
      Igor Canadi 提交于
      Summary:
      To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0.
      
      RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them.
      
      In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly.
      
      Test Plan: Added a new unit test that is failing on master. Unit test succeeds now.
      
      Reviewers: sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32535
      b088c83e
  15. 18 3月, 2015 2 次提交
    • A
      Create an abstract interface for write batches · 81345b90
      agiardullo 提交于
      Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class.  This makes it easier to write code that is agnostic toward the implementation of the particular write batch.  In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch.
      
      Test Plan: modified existing WriteBatchWithIndex tests to test new functions.  Running all tests.
      
      Reviewers: igor, rven, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D34017
      81345b90
    • I
      Deprecate removeScanCountLimit in NewLRUCache · c88ff4ca
      Igor Canadi 提交于
      Summary: It is no longer used by the implementation, so we should also remove it from the public API.
      
      Test Plan: make check
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D34971
      c88ff4ca
  16. 03 3月, 2015 1 次提交
    • I
      options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size... · db037393
      Igor Canadi 提交于
      options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
      
      Summary:
      When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
      
      In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
      
      Test Plan: New unit tests and pass tests suites including valgrind.
      
      Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
      
      Reviewed By: ikabiljo
      
      Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31437
      db037393
  17. 28 2月, 2015 1 次提交
    • I
      Fix a bug in ReadOnlyBackupEngine · b9ff6b05
      Igor Canadi 提交于
      Summary:
      This diff fixes a bug introduced by D28521. Read-only backup engine can delete a backup that is later than the latest -- we never check the condition.
      
      I also added a bunch of logging that will help with debugging cases like this in the future.
      
      See more discussion at t6218248.
      
      Test Plan: Added a unit test that was failing before the change. Also, see new LOG file contents: https://phabricator.fb.com/P19738984
      
      Reviewers: benj, sanketh, sumeet, yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D33897
      b9ff6b05
  18. 12 2月, 2015 1 次提交
    • S
      Remember whole key/prefix filtering on/off in SST file · 68af7811
      sdong 提交于
      Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter.
      
      Test Plan: Add a unit test for it
      
      Reviewers: yhchiang, igor, rven
      
      Reviewed By: rven
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D32889
      68af7811
  19. 10 2月, 2015 1 次提交
  20. 05 2月, 2015 1 次提交
    • S
      Get() to use prefix bloom filter when filter is not block based · e63140d5
      sdong 提交于
      Summary:
      Get() now doesn't make use of bloom filter if it is prefix based. Add the check.
      Didn't touch block based bloom filter. I can't fully reason whether it is correct to do that. But it's straight-forward to for full bloom filter.
      
      Test Plan:
      make all check
      Add a test case in DBTest
      
      Reviewers: rven, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D31941
      e63140d5
  21. 31 1月, 2015 1 次提交
  22. 30 1月, 2015 1 次提交
  23. 23 1月, 2015 1 次提交
  24. 15 1月, 2015 1 次提交
    • I
      New BlockBasedTable version -- better compressed block format · 9ab5adfc
      Igor Canadi 提交于
      Summary:
      This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
      1) Zlib -- encode decompressed size in compressed block header
      2) BZip2 -- encode decompressed size in compressed block header
      3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.
      
      It does not affect format for snappy.
      
      If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.
      
      Test Plan:
      Added a new test in db_test.
      I will also run db_bench and verify VSIZE when block_cache == 1GB
      
      Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31461
      9ab5adfc
  25. 06 1月, 2015 1 次提交
    • I
      Deprecating skip_log_error_on_recovery · 62ad0a9b
      Igor Canadi 提交于
      Summary:
      Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.
      
      The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.
      
      Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.
      
      Reviewers: yhchiang, rven, dhruba, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30603
      62ad0a9b
  26. 23 12月, 2014 1 次提交
  27. 22 12月, 2014 1 次提交
    • I
      Speed up FindObsoleteFiles() · 0acc7388
      Igor Canadi 提交于
      Summary:
      There are two versions of FindObsoleteFiles():
      * full scan, which is executed every 6 hours (and it's terribly slow)
      * no full scan, which is executed every time a background process finishes and iterator is deleted
      
      This diff is optimizing the second case (no full scan). Here's what we do before the diff:
      * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
      * Get the list of live files to avoid deleting files that are live.
      * Delete files that are in obsolete_files and not in live_files.
      
      After this diff:
      * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
      * Get the list of obsolete files (which exclude moved files).
      * No need to get the list of live files, since all files in obsolete_files need to be deleted.
      
      I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123
      
      This depends on D30123.
      
      P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.
      
      Test Plan:
      One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.
      
      make check
      
      Big number of compactions and flushes:
      
        ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30249
      0acc7388
  28. 17 12月, 2014 1 次提交
  29. 16 12月, 2014 1 次提交
    • V
      RocksDB: Allow Level-Style Compaction to Place Files in Different Paths · 153f4f07
      Venkatesh Radhakrishnan 提交于
      Summary:
      Allow Level-style compaction to place files in different paths
      This diff provides the code for task 4854591. We now support level-compaction
      to place files in different paths by specifying  them in db_paths  along with
      the minimum level for files to store in that path.
      
      Test Plan: ManualLevelCompactionOutputPathId in db_test.cc
      
      Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong
      
      Reviewed By: sdong
      
      Subscribers: yoshinorim, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D29799
      153f4f07
  30. 15 12月, 2014 1 次提交
    • I
      Optimize default compile to compilation platform by default · 06eed650
      Igor Canadi 提交于
      Summary:
      This diff changes compile to optimize for native platform by default. This will automatically turn on crc32 optimizations for modern processors, which greatly improves rocksdb's performance.
      
      I also did some more changes to compilation documentation.
      
      Test Plan:
      compile with `make`, observe -march=native
      compile with `PORTABLE=1 make`, observe no -march=native
      
      Reviewers: sdong, rven, yhchiang, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30225
      06eed650
  31. 11 12月, 2014 1 次提交
    • A
      Modifed the LRU cache eviction code so that it doesn't evict blocks which have exteranl references · ee95cae9
      Alexey Maykov 提交于
      Summary:
      Currently, blocks which have more than one reference (ie referenced by something other than cache itself) are evicted from cache. This doesn't make much sense:
      - blocks are still in RAM, so the RAM usage reported by the cache is incorrect
      - if the same block is needed by another iterator, it will be loaded and decompressed again
      
      This diff changes the reference counting scheme a bit. Previously, if the cache contained the block, this was accounted for in its refcount. After this change, the refcount is only used to track external references. There is a boolean flag which indicates whether or not the block is contained in the cache.
      This diff also changes how LRU list is used. Previously, both hashtable and the LRU list contained all blocks. After this change, the LRU list contains blocks with the refcount==0, ie those which can be evicted from the cache.
      
      Note that this change still allows for cache to grow beyond its capacity. This happens when all blocks are pinned (ie refcount>0). This is consistent with the current behavior. The cache's insert function never fails. I spent lots of time trying to make table_reader and other places work with the insert which might failed. It turned out to be pretty hard. It might really destabilize some customers, so finally, I decided against doing this.
      
      table_cache_remove_scan_count_limit option will be unneeded after this change, but I will remove it in the following diff, if this one gets approved
      
      Test Plan: Ran tests, made sure they pass
      
      Reviewers: sdong, ljin
      
      Differential Revision: https://reviews.facebook.net/D25503
      ee95cae9
  32. 09 12月, 2014 1 次提交
  33. 03 12月, 2014 1 次提交
    • J
      Enforce write buffer memory limit across column families · a14b7873
      Jonah Cohen 提交于
      Summary:
      Introduces a new class for managing write buffer memory across column
      families.  We supplement ColumnFamilyOptions::write_buffer_size with
      ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
      instance that enforces memory limits before flushing out to disk.
      
      Test Plan: Added SharedWriteBuffer unit test to db_test.cc
      
      Reviewers: sdong, rven, ljin, igor
      
      Reviewed By: igor
      
      Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D22581
      a14b7873
  34. 21 11月, 2014 2 次提交
    • V
      Moved checkpoint to utilities · 004f416b
      Venkatesh Radhakrishnan 提交于
      Summary:
      Moved checkpoint to utilities.
      Addressed comments by Igor, Siying, Dhruba
      
      Test Plan: db_test/SnapshotLink
      
      Reviewers: dhruba, igor, sdong
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D29079
      004f416b
    • Y
      Introduce GetThreadList API · d0c5f28a
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add GetThreadList API, which allows developer to track the
      status of each process.  Currently, calling GetThreadList will
      only get the list of background threads in RocksDB with their
      thread-id and thread-type (priority) set.  Will add more support
      on this in the later diffs.
      
      ThreadStatus currently has the following properties:
      
        // An unique ID for the thread.
        const uint64_t thread_id;
      
        // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
        // ROCKSDB_LOW_PRIORITY, and USER_THREAD
        const ThreadType thread_type;
      
        // The name of the DB instance where the thread is currently
        // involved with.  It would be set to empty string if the thread
        // does not involve in any DB operation.
        const std::string db_name;
      
        // The name of the column family where the thread is currently
        // It would be set to empty string if the thread does not involve
        // in any column family.
        const std::string cf_name;
      
        // The event that the current thread is involved.
        // It would be set to empty string if the information about event
        // is not currently available.
      
      Test Plan:
      ./thread_list_test
      export ROCKSDB_TESTS=GetThreadList
      ./db_test
      
      Reviewers: rven, igor, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D25047
      d0c5f28a