1. 28 2月, 2018 1 次提交
    • A
      skip CompactRange flush based on memtable contents · 3ae00472
      Andrew Kryczka 提交于
      Summary:
      CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
      
      - Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
      - Reuse the above logic in `CompactRange` and skip flushing if no overlap
      Closes https://github.com/facebook/rocksdb/pull/3520
      
      Differential Revision: D7018897
      
      Pulled By: ajkr
      
      fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
      3ae00472
  2. 17 2月, 2018 1 次提交
    • M
      Fix deadlock in ColumnFamilyData::InstallSuperVersion() · 97307d88
      Mike Kolupaev 提交于
      Summary:
      Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.
      
      This deadlock is hit all the time on our workload. It blocks our release.
      
      In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.
      
      So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.
      
      This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.
      
      I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
      Closes https://github.com/facebook/rocksdb/pull/3510
      
      Reviewed By: sagar0
      
      Differential Revision: D7005346
      
      Pulled By: al13n321
      
      fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
      97307d88
  3. 16 2月, 2018 1 次提交
    • J
      Several small "fixes" · 4e7a182d
      jsteemann 提交于
      Summary:
      - removed a few unneeded variables
      - fused some variable declarations and their assignments
      - fixed right-trimming code in string_util.cc to not underflow
      - simplifed an assertion
      - move non-nullptr check assertion before dereferencing of that pointer
      - pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
      Closes https://github.com/facebook/rocksdb/pull/3507
      
      Differential Revision: D7004679
      
      Pulled By: sagar0
      
      fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
      4e7a182d
  4. 13 2月, 2018 1 次提交
    • A
      Add delay before flush in CompactRange to avoid write stalling · ee1c8026
      Andrew Kryczka 提交于
      Summary:
      - Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
      - Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
      - Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
      Closes https://github.com/facebook/rocksdb/pull/3381
      
      Differential Revision: D6754983
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
      ee1c8026
  5. 10 2月, 2018 1 次提交
  6. 03 2月, 2018 1 次提交
  7. 01 2月, 2018 1 次提交
  8. 24 1月, 2018 1 次提交
  9. 19 1月, 2018 1 次提交
    • Y
      Fix Flush() keep waiting after flush finish · f1cb83fc
      Yi Wu 提交于
      Summary:
      Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
      1. User call Flush() with flush_options.wait = true
      2. The manual flush started in the background
      3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
      4. The manual flush finish.
      
      Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.
      
      Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
      Closes https://github.com/facebook/rocksdb/pull/3378
      
      Differential Revision: D6746789
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
      f1cb83fc
  10. 11 11月, 2017 1 次提交
  11. 03 11月, 2017 2 次提交
    • A
      pass key/value samples through zstd compression dictionary generator · 24ad4306
      Andrew Kryczka 提交于
      Summary:
      Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.
      
      Note this is the first step of #2987, extracted into a separate PR per reviewer request.
      Closes https://github.com/facebook/rocksdb/pull/3057
      
      Differential Revision: D6116891
      
      Pulled By: ajkr
      
      fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
      24ad4306
    • A
      dynamically change current memtable size · c4c1f961
      Andrew Kryczka 提交于
      Summary:
      Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM.
      
      This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable.
      Closes https://github.com/facebook/rocksdb/pull/3119
      
      Differential Revision: D6228304
      
      Pulled By: ajkr
      
      fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9
      c4c1f961
  12. 24 10月, 2017 1 次提交
  13. 12 10月, 2017 1 次提交
  14. 06 10月, 2017 1 次提交
  15. 08 9月, 2017 1 次提交
  16. 16 7月, 2017 1 次提交
  17. 13 7月, 2017 1 次提交
  18. 23 6月, 2017 1 次提交
    • S
      Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty() · 6837a176
      Siying Dong 提交于
      Summary:
      CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.
      
      Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
      Closes https://github.com/facebook/rocksdb/pull/2475
      
      Differential Revision: D5298053
      
      Pulled By: siying
      
      fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
      6837a176
  19. 06 6月, 2017 2 次提交
  20. 03 6月, 2017 1 次提交
    • A
      Pass CF ID to MemTableRepFactory · a4d9c025
      Andrew Kryczka 提交于
      Summary:
      Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:
      
      - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
      - updates MemTable to create MemTableRep's using the new overload.
      Closes https://github.com/facebook/rocksdb/pull/2346
      
      Differential Revision: D5108061
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
      a4d9c025
  21. 27 5月, 2017 1 次提交
  22. 18 5月, 2017 1 次提交
    • M
      Support ingest_behind for IngestExternalFile · ba685a47
      Mikhail Antonov 提交于
      Summary:
      First cut for early review; there are few conceptual points to answer and some code structure issues.
      
      For conceptual points -
      
       - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
       - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
       - Modifying AssignLevelForIngestedFile seems the place we we'd handle that.
      
      On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
      Closes https://github.com/facebook/rocksdb/pull/2144
      
      Differential Revision: D4873732
      
      Pulled By: lightmark
      
      fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
      ba685a47
  23. 05 5月, 2017 1 次提交
  24. 02 5月, 2017 1 次提交
  25. 28 4月, 2017 1 次提交
  26. 07 4月, 2017 1 次提交
    • S
      Refactor compaction picker code · ff972870
      Siying Dong 提交于
      Summary:
      1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h.
      2. Rename some functions to make the code easier to understand.
      3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions.
      Closes https://github.com/facebook/rocksdb/pull/2100
      
      Differential Revision: D4845948
      
      Pulled By: siying
      
      fbshipit-source-id: efa0ab4
      ff972870
  27. 06 4月, 2017 1 次提交
  28. 16 3月, 2017 1 次提交
    • I
      Add macros to include file name and line number during Logging · e1916368
      Islam AbdelRahman 提交于
      Summary:
      current logging
      ```
      2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25
      2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2.
      2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK
      2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK
      2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0.
      2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6
      2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
      2017/03/14-14:20:31.
      Closes https://github.com/facebook/rocksdb/pull/1990
      
      Differential Revision: D4708695
      
      Pulled By: IslamAbdelRahman
      
      fbshipit-source-id: cb8968f
      e1916368
  29. 24 2月, 2017 1 次提交
  30. 20 1月, 2017 1 次提交
    • R
      Fix for 2PC causing WAL to grow too large · 5cf176ca
      Reid Horuff 提交于
      Summary:
      Consider the following single column family scenario:
      prepare in log A
      commit in log B
      *WAL is too large, flush all CFs to releast log A*
      *CFA is on log B so we do not see CFA is depending on log A so no flush is requested*
      
      To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
      Closes https://github.com/facebook/rocksdb/pull/1768
      
      Differential Revision: D4403265
      
      Pulled By: reidHoruff
      
      fbshipit-source-id: ce800ff
      5cf176ca
  31. 30 11月, 2016 1 次提交
  32. 29 11月, 2016 1 次提交
  33. 24 11月, 2016 1 次提交
    • S
      Improve Write Stalling System · cd7c4143
      Siying Dong 提交于
      Summary:
      Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.
      
      Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
      Closes https://github.com/facebook/rocksdb/pull/1562
      
      Differential Revision: D4218519
      
      Pulled By: siying
      
      fbshipit-source-id: 95e4088
      cd7c4143
  34. 23 11月, 2016 1 次提交
    • Y
      Unified InlineSkipList::Insert algorithm with hinting · dfb6fe67
      Yi Wu 提交于
      Summary:
      This PR is based on nbronson's diff with small
      modifications to wire it up with existing interface. Comparing to
      previous version, this approach works better for inserting keys in
      decreasing order or updating the same key, and impose less restriction
      to the prefix extractor.
      
      ---- Summary from original diff ----
      
      This diff introduces a single InlineSkipList::Insert that unifies
      the existing sequential insert optimization (prev_), concurrent insertion,
      and insertion using externally-managed insertion point hints.
      
      There's a deep symmetry between insertion hints (cursors) and the
      concurrent algorithm.  In both cases we have partial information from
      the recent past that is likely but not certain to be accurate.  This diff
      introduces the struct InlineSkipList::Splice, which encodes predecessor
      and successor information in the same form that was previously only used
      within a single call to InsertConcurrently.  Splice holds information
      about an insertion point that can be used to levera
      Closes https://github.com/facebook/rocksdb/pull/1561
      
      Differential Revision: D4217283
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 33ee437
      dfb6fe67
  35. 16 11月, 2016 2 次提交
  36. 14 11月, 2016 1 次提交
    • Y
      Optimize sequential insert into memtable - Part 1: Interface · 1ea79a78
      Yi Wu 提交于
      Summary:
      Currently our skip-list have an optimization to speedup sequential
      inserts from a single stream, by remembering the last insert position.
      We extend the idea to support sequential inserts from multiple streams,
      and even tolerate small reordering wihtin each stream.
      
      This PR is the interface part adding the following:
      - Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
      - Add `InsertWithHint()` interface to memtable, to allow underlying
        implementation to return a hint of insert position, which can be later
        pass back to optimize inserts.
      - Memtable will maintain a map from prefix to hints and pass the hint
        via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
      Closes https://github.com/facebook/rocksdb/pull/1419
      
      Differential Revision: D4079367
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3555326
      1ea79a78
  37. 13 11月, 2016 1 次提交