1. 07 11月, 2015 1 次提交
    • D
      Enable Windows warnings C4307 C4309 C4512 C4701 · 20f57b17
      Dmitri Smirnov 提交于
        Enable C4307 'operator' : integral constant overflow
        Longs and ints on Windows are 32-bit hence the overflow
        Enable C4309 'conversion' : truncation of constant value
        Enable C4512 'class' : assignment operator could not be generated
        Enable C4701 Potentially uninitialized local variable 'name' used
      20f57b17
  2. 06 11月, 2015 1 次提交
    • V
      Prefix-based iterating only shows keys in prefix · 9d50afc3
      Venkatesh Radhakrishnan 提交于
      Summary:
      MyRocks testing found an issue that while iterating over keys
      that are outside the prefix, sometimes wrong results were seen for keys
      outside the prefix. We now tighten the range of keys seen with a new
      read option called prefix_seen_at_start. This remembers the starting
      prefix and then compares it on a Next for equality of prefix. If they
      are from a different prefix, it sets valid to false.
      
      Test Plan: PrefixTest.PrefixValid
      
      Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony
      
      Reviewed By: anthony
      
      Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D50211
      9d50afc3
  3. 31 10月, 2015 1 次提交
  4. 30 10月, 2015 2 次提交
  5. 28 10月, 2015 1 次提交
    • D
      Implement smart buffer management. · 6fbc4f9f
      Dmitri Smirnov 提交于
        introduce a new DBOption random_access_max_buffer_size to limit
        the size of the random access buffer used for unbuffered access.
        Implement read ahead buffering when enabled.
        To that effect propagate compaction_readahead_size and the new option
        to the env options to make it available for the implementation.
        Add Hint() override so SetupForCompaction() call would call Hint()
        readahead can now be setup from both Hint() and EnableReadAhead()
        Add new option random_access_max_buffer_size support
        db_bench, options_helper to make it string parsable
        and the unit test.
      6fbc4f9f
  6. 20 10月, 2015 1 次提交
  7. 19 10月, 2015 1 次提交
  8. 14 10月, 2015 1 次提交
  9. 13 10月, 2015 1 次提交
  10. 08 10月, 2015 1 次提交
  11. 22 9月, 2015 1 次提交
    • S
      Add a mode to always pick the oldest file to compact for each level · f1b9f804
      sdong 提交于
      Summary:
      Add options.compaction_pri, which specifies the policy about which file to compact first.
      kCompactionPriByLargestSeq will compact oldest files first.
      Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.
      
      Test Plan: Will write unit tests
      
      Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan
      
      Reviewed By: yhchiang
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D45951
      f1b9f804
  12. 17 9月, 2015 1 次提交
  13. 16 9月, 2015 1 次提交
    • A
      Add DBOption.max_subcompaction to option dump · 2b683d49
      Ari Ekmekji 提交于
      Summary:
      RocksDB options can be dumped to the log file, and
      up to this point the max_subcompactions option was not included
      in this dump. This fixes that.
      
      Test Plan: makek all && make check
      
      Reviewers: MarkCallaghan, igor, noetzli, anthony, yhchiang, sdong
      
      Reviewed By: yhchiang, sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D46971
      2b683d49
  14. 15 9月, 2015 2 次提交
  15. 27 8月, 2015 1 次提交
    • I
      ReadaheadRandomAccessFile -- userspace readahead · 5f4166c9
      Igor Canadi 提交于
      Summary:
      ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.
      
      We add ReadaheadRandomAccessFile layer only when file is read during compactions.
      
      D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.
      
      Test Plan: make check
      
      Reviewers: MarkCallaghan, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D45123
      5f4166c9
  16. 22 8月, 2015 1 次提交
    • A
      Changed 'num_subcompactions' to the more accurate 'max_subcompactions' · b6def58f
      Ari Ekmekji 提交于
      Summary:
      Up until this point we had DbOptions.num_subcompactions, but
      it is semantically more correct to call this max_subcompactions since
      we will schedule *up to* DbOptions.max_subcompactions smaller compactions
      at a time during a compaction job.
      
      I also added a --subcompactions option to db_bench
      
      Test Plan: make all   make check
      
      Reviewers: sdong, igor, anthony, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D45069
      b6def58f
  17. 21 8月, 2015 1 次提交
    • S
      Add options.new_table_reader_for_compaction_inputs · 9130873a
      sdong 提交于
      Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.
      
      Test Plan: Add the option.
      
      Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
      
      Reviewed By: igor
      
      Subscribers: igor, MarkCallaghan, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D43311
      9130873a
  18. 14 8月, 2015 1 次提交
    • S
      Add options.compaction_measure_io_stats to print write I/O stats in compactions · 603b6da8
      sdong 提交于
      Summary:
      Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:
      
      2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}
      
      Add two more counters in iostats_context.
      
      Also add a parameter of db_bench.
      
      Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D44115
      603b6da8
  19. 12 8月, 2015 1 次提交
    • I
      Parallelize LoadTableHandlers · cee1e8a0
      Islam AbdelRahman 提交于
      Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover
      
      Test Plan:
      make check -j64
      COMPILE_WITH_TSAN=1 make check -j64
      DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)
      
      Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D43755
      cee1e8a0
  20. 11 8月, 2015 1 次提交
  21. 05 8月, 2015 2 次提交
    • I
      Support delete rate limiting · c45a57b4
      Islam AbdelRahman 提交于
      Summary:
      Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
      Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.
      
      I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile
      
      Test Plan:
      added delete_scheduler_test
      existing unit tests
      
      Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D43221
      c45a57b4
    • Y
      Add DBOptions::skip_sats_update_on_db_open · 14d0bfa4
      Yueh-Hsuan Chiang 提交于
      Summary:
      UpdateAccumulatedStats() is used to optimize compaction decision
      esp. when the number of deletion entries are high, but this function
      can slowdown DBOpen esp. in disk environment.
      
      This patch adds DBOptions::skip_sats_update_on_db_open, which skips
      UpdateAccumulatedStats() in DB::Open() time when it's set to true.
      
      Test Plan: Add DBCompactionTest.SkipStatsUpdateTest
      
      Reviewers: igor, anthony, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: tnovak, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42843
      14d0bfa4
  22. 04 8月, 2015 1 次提交
    • A
      Parallelize L0-L1 Compaction: Restructure Compaction Job · 40c64434
      Ari Ekmekji 提交于
      Summary:
      As of now compactions involving files from Level 0 and Level 1 are single
      threaded because the files in L0, although sorted, are not range partitioned like
      the other levels. This means that during L0-L1 compaction each file from L1
      needs to be merged with potentially all the files from L0.
      
      This attempt to parallelize the L0-L1 compaction assigns a thread and a
      corresponding iterator to each L1 file that then considers only the key range
      found in that L1 file and only the L0 files that have those keys (and only the
      specific portion of those L0 files in which those keys are found). In this way
      the overlap is minimized and potentially eliminated between different iterators
      focusing on the same files.
      
      The first step is to restructure the compaction logic to break L0-L1 compactions
      into multiple, smaller, sequential compactions. Eventually each of these smaller
      jobs will be run simultaneously. Areas to pay extra attention to are
      
        # Correct aggregation of compaction job statistics across multiple threads
        # Proper opening/closing of output files (make sure each thread's is unique)
        # Keys that span multiple L1 files
        # Skewed distributions of keys within L0 files
      
      Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test
      
      Reviewers: igor, noetzli, anthony, sdong, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: MarkCallaghan, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42699
      40c64434
  23. 18 7月, 2015 2 次提交
    • I
      Don't let flushes preempt compactions · 35ca5936
      Igor Canadi 提交于
      Summary:
      When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
      
      We have a special case that when you set max_background_flushes to 0, we
      schedule the flush to execute on the compaction thread.
      
      Test Plan: make check (there might be some unit tests that depend on this behavior)
      
      Reviewers: IslamAbdelRahman, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D41931
      35ca5936
    • I
      Deprecate CompactionFilterV2 · a96fcd09
      Igor Canadi 提交于
      Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.
      
      Test Plan: make check
      
      Reviewers: noetzli, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42405
      a96fcd09
  24. 14 7月, 2015 2 次提交
    • I
      Deprecate purge_redundant_kvs_while_flush · a9c51095
      Igor Canadi 提交于
      Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.
      
      Test Plan: none
      
      Reviewers: sdong, yhchiang, anthony, dhruba
      
      Reviewed By: dhruba
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D42063
      a9c51095
    • S
      "make format" against last 10 commits · f9728640
      sdong 提交于
      Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.
      
      Test Plan: Build it.
      
      Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D41961
      f9728640
  25. 02 7月, 2015 2 次提交
    • D
      Address GCC compilation issues · ca2fe2c1
      Dmitri Smirnov 提交于
       invalid suffix on literal
       no return statement in function returning non-void CuckooStep::operator=
       extra qualification ‘rocksdb::spatial::Variant::
       dereferencing type-punned pointer will break strict-aliasing rules
      ca2fe2c1
    • D
      Windows Port from Microsoft · 18285c1e
      Dmitri Smirnov 提交于
       Summary: Make RocksDb build and run on Windows to be functionally
       complete and performant. All existing test cases run with no
       regressions. Performance numbers are in the pull-request.
      
       Test plan: make all of the existing unit tests pass, obtain perf numbers.
      
       Co-authored-by: Praveen Rao praveensinghrao@outlook.com
       Co-authored-by: Sherlock Huang baihan.huang@gmail.com
       Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
       Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
      18285c1e
  26. 24 6月, 2015 1 次提交
    • G
      Implement a table-level row cache · 782a1590
      Giuseppe Ottaviano 提交于
      Summary:
      Implementation of a table-level row cache.
      It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
      
      Supports snapshots and merge operations.
      
      Test Plan: Ran `make valgrind_check commit-prereq`
      
      Reviewers: igor, philipp, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39849
      782a1590
  27. 23 6月, 2015 1 次提交
    • K
      Introduce WAL recovery consistency levels · de85e4ca
      krad 提交于
      Summary:
      The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
      
      1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
      
      This mocks the current recovery mode.
      
      2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
      
      This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
      
      3. RecoverPointInTime (kPointInTimeRecovery)
      
      This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
      
      4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
      
      This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
      
      Test Plan:
      (1) Run added unit test to cover all levels.
      (2) Run make check.
      
      Reviewers: leveldb, sdong, igor
      
      Subscribers: yoshinorim, dhruba
      
      Differential Revision: https://reviews.facebook.net/D38487
      de85e4ca
  28. 19 6月, 2015 2 次提交
    • I
      Fail DB::Open() when the requested compression is not available · 760e9a94
      Igor Canadi 提交于
      Summary:
      Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
      
      This patch fails DB::Open() if we don't support the compression that is specified in the options.
      
      Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
      
      Reviewers: sdong, MarkCallaghan, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D39687
      760e9a94
    • I
      Don't dump DBOptions for each column family · 4b8bb62f
      Igor Canadi 提交于
      Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.
      
      Test Plan: Check out the LOG
      
      Reviewers: sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D39729
      4b8bb62f
  29. 12 6月, 2015 1 次提交
    • S
      Slow down writes by bytes written · 7842920b
      sdong 提交于
      Summary:
      We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
      
      The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
      
      hard_rate_limit is deprecated.
      
      options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
      
      Test Plan: Add new unit tests in db_test
      
      Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
      
      Reviewed By: igor
      
      Subscribers: ikabiljo, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D36351
      7842920b
  30. 09 6月, 2015 1 次提交
    • I
      Use nullptr for default compaction_filter_factory · 643bbbf0
      Islam AbdelRahman 提交于
      Summary:
      Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
      The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI
      
      Test Plan: make check
      
      Reviewers: yoshinorim, ott, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39693
      643bbbf0
  31. 30 5月, 2015 1 次提交
  32. 29 5月, 2015 2 次提交
    • A
      Support saving history in memtable_list · c8153510
      agiardullo 提交于
      Summary:
      For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
      
      After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
      
      This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
      
      However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
      
      Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.
      
      Reviewers: sdong, rven, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D37443
      c8153510
    • Y
      [API Change] Move listeners from ColumnFamilyOptions to DBOptions · 672dda9b
      Yueh-Hsuan Chiang 提交于
      Summary: Move listeners from ColumnFamilyOptions to DBOptions
      
      Test Plan:
      listener_test
      compact_files_test
      
      Reviewers: rven, anthony, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D39087
      672dda9b