1. 03 4月, 2018 1 次提交
    • S
      Level Compaction with TTL · 04c11b86
      Sagar Vemuri 提交于
      Summary:
      Level Compaction with TTL.
      
      As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
      
      Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
      This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
      
      This functionality can be controlled by the newly introduced column family option -- ttl.
      
      TODO for later:
      - Make ttl mutable
      - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
      - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
      Closes https://github.com/facebook/rocksdb/pull/3591
      
      Differential Revision: D7275442
      
      Pulled By: sagar0
      
      fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
      04c11b86
  2. 06 3月, 2018 1 次提交
  3. 23 2月, 2018 2 次提交
  4. 20 10月, 2017 1 次提交
    • S
      Make FIFO compaction options dynamically configurable · f0804db7
      Sagar Vemuri 提交于
      Summary:
      ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
      
      Some of the ways in which the fifo compaction options can be set are:
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
      - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
      - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
      
      Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
      Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
      The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
      Closes https://github.com/facebook/rocksdb/pull/3006
      
      Differential Revision: D6058619
      
      Pulled By: sagar0
      
      fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
      f0804db7
  5. 22 7月, 2017 2 次提交
  6. 16 7月, 2017 1 次提交
  7. 29 6月, 2017 1 次提交
    • M
      Improve Status message for block checksum mismatches · 397ab111
      Mike Kolupaev 提交于
      Summary:
      We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.
      
      It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
      Closes https://github.com/facebook/rocksdb/pull/2507
      
      Differential Revision: D5345702
      
      Pulled By: al13n321
      
      fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
      397ab111
  8. 28 4月, 2017 1 次提交
  9. 22 4月, 2017 1 次提交
  10. 14 4月, 2017 1 次提交
    • A
      change use_direct_writes to use_direct_io_for_flush_and_compaction · 44fa8ece
      Aaron Gao 提交于
      Summary:
      Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
      Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
      Closes https://github.com/facebook/rocksdb/pull/2117
      
      Differential Revision: D4860912
      
      Pulled By: lightmark
      
      fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
      44fa8ece
  11. 14 3月, 2017 1 次提交
    • M
      Pinnableslice (2nd attempt) · 11526252
      Maysam Yabandeh 提交于
      Summary:
      PinnableSlice
      
          Summary:
          Currently the point lookup values are copied to a string provided by the
          user. This incures an extra memcpy cost. This patch allows doing point lookup
          via a PinnableSlice which pins the source memory location (instead of
          copying their content) and releases them after the content is consumed
          by the user. The old API of Get(string) is translated to the new API
          underneath.
      
          Here is the summary for improvements:
      
          value 100 byte: 1.8% regular, 1.2% merge values
          value 1k byte: 11.5% regular, 7.5% merge values
          value 10k byte: 26% regular, 29.9% merge values
          The improvement for merge could be more if we extend this approach to
          pin the merge output and delay the full merge operation until the user
          actually needs it. We have put that for future work.
      
          PS:
          Sometimes we observe a small decrease in performance when switching from
          t5452014 to this patch but with the old Get(string) API. The d
      Closes https://github.com/facebook/rocksdb/pull/1756
      
      Differential Revision: D4391738
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 6f3edd3
      11526252
  12. 24 2月, 2017 1 次提交
  13. 14 2月, 2017 1 次提交
    • S
      Remove disableDataSync option · eb912a92
      Sagar Vemuri 提交于
      Summary:
      Remove disableDataSync, and another similarly named disable_data_sync options.
      This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
      Closes https://github.com/facebook/rocksdb/pull/1859
      
      Differential Revision: D4541292
      
      Pulled By: sagar0
      
      fbshipit-source-id: 5b3a6ca
      eb912a92
  14. 09 1月, 2017 2 次提交
    • M
      Revert "PinnableSlice" · d0ba8ec8
      Maysam Yabandeh 提交于
      Summary:
      This reverts commit 54d94e9c.
      
      The pull request was landed by mistake.
      Closes https://github.com/facebook/rocksdb/pull/1755
      
      Differential Revision: D4391678
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 36d5149
      d0ba8ec8
    • M
      PinnableSlice · 54d94e9c
      Maysam Yabandeh 提交于
      Summary:
      Currently the point lookup values are copied to a string provided by the user.
      This incures an extra memcpy cost. This patch allows doing point lookup
      via a PinnableSlice which pins the source memory location (instead of
      copying their content) and releases them after the content is consumed
      by the user. The old API of Get(string) is translated to the new API
      underneath.
      
       Here is the summary for improvements:
       1. value 100 byte: 1.8%  regular, 1.2% merge values
       2. value 1k   byte: 11.5% regular, 7.5% merge values
       3. value 10k byte: 26% regular,    29.9% merge values
      
       The improvement for merge could be more if we extend this approach to
       pin the merge output and delay the full merge operation until the user
       actually needs it. We have put that for future work.
      
      PS:
      Sometimes we observe a small decrease in performance when switching from
      t5452014 to this patch but with the old Get(string) API. The difference
      is a little and could be noise. More importantly it is safely
      cancelled
      Closes https://github.com/facebook/rocksdb/pull/1732
      
      Differential Revision: D4374613
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a077f1a
      54d94e9c
  15. 23 12月, 2016 1 次提交
    • A
      direct io write support · 972f96b3
      Aaron Gao 提交于
      Summary:
      rocksdb direct io support
      
      ```
      [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 5.0
      Date:       Wed Nov 23 13:17:43 2016
      CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
      CPUCache:   25600 KB
      Keys:       16 bytes each
      Values:     100 bytes each (50 bytes after compression)
      Entries:    1000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    110.6 MB (estimated)
      FileSize:   62.9 MB (estimated)
      Write rate: 0 bytes/second
      Compression: Snappy
      Memtablerep: skip_list
      Perf Level: 1
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      ------------------------------------------------
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      DB path: [/tmp/rocksdbtest-112628/dbbench]
      fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s
      
      [gzh@dev11575.prn2 ~/roc
      Closes https://github.com/facebook/rocksdb/pull/1564
      
      Differential Revision: D4241093
      
      Pulled By: lightmark
      
      fbshipit-source-id: 98c29e3
      972f96b3
  16. 03 11月, 2016 1 次提交
  17. 08 10月, 2016 1 次提交
    • I
      Support running consistency checks in release mode · 2ad68b97
      Islam AbdelRahman 提交于
      Summary:
      We always run consistency checks when compiling in debug mode
      allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode
      
      Test Plan:
      make check -j64
      make release
      
      Reviewers: lightmark, sdong, yiwu
      
      Reviewed By: yiwu
      
      Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64701
      2ad68b97
  18. 14 9月, 2016 1 次提交
    • Y
      Refactor GetMutableOptionsFromStrings · 8e061f97
      Yi Wu 提交于
      Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.
      
      Test Plan: unit test
      
      Reviewers: yhchiang, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D63843
      8e061f97
  19. 08 9月, 2016 1 次提交
  20. 02 9月, 2016 1 次提交
    • S
      Merge options source_compaction_factor, max_grandparent_overlap_bytes and... · 32149059
      sdong 提交于
      Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
      
      Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
      
      Test Plan: Add two new unit tests. Run all existing tests, including jtest.
      
      Reviewers: yhchiang, igor, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D59829
      32149059
  21. 27 7月, 2016 1 次提交
    • S
      Change options memtable_prefix_bloom_huge_page_tlb_size =>... · e5b5f12b
      sdong 提交于
      Change options memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size and cover huge page to memtable too
      
      Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
      
      Test Plan: Run all existing tests.
      
      Reviewers: IslamAbdelRahman, yhchiang, andrewkr
      
      Reviewed By: andrewkr
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D60513
      e5b5f12b
  22. 18 6月, 2016 1 次提交
    • S
      Deprectate filter_deletes · 7b79238b
      sdong 提交于
      Summary: filter_deltes is not a frequently used feature. Remove it.
      
      Test Plan: Run all test suites.
      
      Reviewers: igor, yhchiang, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D59427
      7b79238b
  23. 11 6月, 2016 1 次提交
    • S
      memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate... · 20699df8
      sdong 提交于
      memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
      
      Summary:
      memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
      It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio
      
      Test Plan: Run all existing tests
      
      Reviewers: yhchiang, igor, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D59199
      20699df8
  24. 16 4月, 2016 1 次提交
  25. 02 4月, 2016 1 次提交
    • M
      Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. · 9b519875
      Marton Trencseni 提交于
      Summary:
      When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
      What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
      
      Test Plan:
      'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
      I didn't run the Java tests, I don't have Java set up on my devserver.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D56133
      9b519875
  26. 22 3月, 2016 1 次提交
  27. 18 3月, 2016 1 次提交
    • M
      Adding pin_l0_filter_and_index_blocks_in_cache feature. · 522de4f5
      Marton Trencseni 提交于
      Summary:
      When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
      What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
      When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.
      
      Test Plan:
      Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).
      
      DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
        Mac: OK.
        Linux: with D55287 patched in it's OK.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D54801
      522de4f5
  28. 17 3月, 2016 1 次提交
  29. 10 2月, 2016 1 次提交
  30. 06 2月, 2016 1 次提交
  31. 12 11月, 2015 1 次提交
    • Y
      Add OptionsUtil::LoadOptionsFromFile() API · e11f676e
      Yueh-Hsuan Chiang 提交于
      Summary:
      This patch adds OptionsUtil::LoadOptionsFromFile() and
      OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
      to construct DBOptions and ColumnFamilyOptions from a RocksDB
      options file.  Note that most pointer-typed options such as
      merge_operator will not be constructed.
      
      With this API, developers no longer need to remember all the
      options in order to reopen an existing rocksdb instance like
      the following:
      
        DBOptions db_options;
        std::vector<std::string> cf_names;
        std::vector<ColumnFamilyOptions> cf_opts;
      
        // Load primitive-typed options from an existing DB
        OptionsUtil::LoadLatestOptionsFromDB(
            dbname, &db_options, &cf_names, &cf_opts);
      
        // Initialize necessary pointer-typed options
        cf_opts[0].merge_operator.reset(new MyMergeOperator());
        ...
      
        // Construct the vector of ColumnFamilyDescriptor
        std::vector<ColumnFamilyDescriptor> cf_descs;
        for (size_t i = 0; i < cf_opts.size(); ++i) {
          cf_descs.emplace_back(cf_names[i], cf_opts[i]);
        }
      
        // Open the DB
        DB* db = nullptr;
        std::vector<ColumnFamilyHandle*> cf_handles;
        auto s = DB::Open(db_options, dbname, cf_descs,
                          &handles, &db);
      
      Test Plan:
      Augment existing tests in column_family_test
      options_test
      db_test
      
      Reviewers: igor, IslamAbdelRahman, sdong, anthony
      
      Reviewed By: anthony
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D49095
      e11f676e
  32. 18 10月, 2015 1 次提交
  33. 18 9月, 2015 1 次提交
    • A
      Support for SingleDelete() · 014fd55a
      Andres Noetzli 提交于
      Summary:
      This patch fixes #7460559. It introduces SingleDelete as a new database
      operation. This operation can be used to delete keys that were never
      overwritten (no put following another put of the same key). If an overwritten
      key is single deleted the behavior is undefined. Single deletion of a
      non-existent key has no effect but multiple consecutive single deletions are
      not allowed (see limitations).
      
      In contrast to the conventional Delete() operation, the deletion entry is
      removed along with the value when the two are lined up in a compaction. Note:
      The semantics are similar to @igor's prototype that allowed to have this
      behavior on the granularity of a column family (
      https://reviews.facebook.net/D42093 ). This new patch, however, is more
      aggressive when it comes to removing tombstones: It removes the SingleDelete
      together with the value whenever there is no snapshot between them while the
      older patch only did this when the sequence number of the deletion was older
      than the earliest snapshot.
      
      Most of the complex additions are in the Compaction Iterator, all other changes
      should be relatively straightforward. The patch also includes basic support for
      single deletions in db_stress and db_bench.
      
      Limitations:
      - Not compatible with cuckoo hash tables
      - Single deletions cannot be used in combination with merges and normal
        deletions on the same key (other keys are not affected by this)
      - Consecutive single deletions are currently not allowed (and older version of
        this patch supported this so it could be resurrected if needed)
      
      Test Plan: make all check
      
      Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
      
      Reviewed By: igor
      
      Subscribers: maykov, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D43179
      014fd55a
  34. 18 8月, 2015 1 次提交
    • A
      Simplify querying of merge results · f32a5720
      Andres Notzli 提交于
      Summary:
      While working on supporting mixing merge operators with
      single deletes ( https://reviews.facebook.net/D43179 ),
      I realized that returning and dealing with merge results
      can be made simpler. Submitting this as a separate diff
      because it is not directly related to single deletes.
      
      Before, callers of merge helper had to retrieve the merge
      result in one of two ways depending on whether the merge
      was successful or not (success = result of merge was single
      kTypeValue). For successful merges, the caller could query
      the resulting key/value pair and for unsuccessful merges,
      the result could be retrieved in the form of two deques of
      keys and values. However, with single deletes, a successful merge
      does not return a single key/value pair (if merge
      operands are merged with a single delete, we have to generate
      a value and keep the original single delete around to make
      sure that we are not accidentially producing a key overwrite).
      In addition, the two existing call sites of the merge
      helper were taking the same actions independently from whether
      the merge was successful or not, so this patch simplifies that.
      
      Test Plan: make clean all check
      
      Reviewers: rven, sdong, yhchiang, anthony, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D43353
      f32a5720
  35. 18 7月, 2015 1 次提交
    • S
      Move rate_limiter, write buffering, most perf context instrumentation and most... · 6e9fbeb2
      sdong 提交于
      Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
      
      Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
      
      Test Plan: Run all existing unit tests.
      
      Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D42321
      6e9fbeb2
  36. 09 9月, 2014 1 次提交
  37. 28 8月, 2014 1 次提交