1. 14 4月, 2017 1 次提交
    • Y
      Simplify write thread logic · e9e6e532
      Yi Wu 提交于
      Summary:
      The concept about early exit in write thread implementation is a confusing one. It means that if early exit is allowed, batch group leader will not responsible to exit the batch group, but the last finished writer do. In case we need to mark log synced, or encounter memtable insert error, early exit is disallowed.
      
      This patch remove such a concept by:
      * In all cases, the last finished writer (not necessary leader) is responsible to exit batch group.
      * In case of parallel memtable write, leader will also mark log synced after memtable insert and before signal finish (call `CompleteParallelWorker()`). The purpose is to allow mark log synced (which require locking mutex) can run in parallel to memtable insert in other writers.
      * The last finish writer should handle memtable insert error (update bg_error_) before exiting batch group.
      Closes https://github.com/facebook/rocksdb/pull/2134
      
      Differential Revision: D4869667
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: aec170847c85b90f4179d6a4608a4fe1361544e3
      e9e6e532
  2. 06 4月, 2017 2 次提交
  3. 05 4月, 2017 1 次提交
    • Y
      Refactor WriteImpl (pipeline write part 1) · 9e445318
      Yi Wu 提交于
      Summary:
      Refactor WriteImpl() so when I plug-in the pipeline write code (which is
      an alternative approach for WriteThread), some of the logic can be
      reuse. I split out the following methods from WriteImpl():
      
      * PreprocessWrite()
      * HandleWALFull() (previous MaybeFlushColumnFamilies())
      * HandleWriteBufferFull()
      * WriteToWAL()
      
      Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h.
      No real logic change in this patch.
      Closes https://github.com/facebook/rocksdb/pull/2042
      
      Differential Revision: D4781014
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: d45ca18
      9e445318
  4. 30 3月, 2017 1 次提交
  5. 23 3月, 2017 1 次提交
  6. 14 3月, 2017 1 次提交
    • M
      Pinnableslice (2nd attempt) · 11526252
      Maysam Yabandeh 提交于
      Summary:
      PinnableSlice
      
          Summary:
          Currently the point lookup values are copied to a string provided by the
          user. This incures an extra memcpy cost. This patch allows doing point lookup
          via a PinnableSlice which pins the source memory location (instead of
          copying their content) and releases them after the content is consumed
          by the user. The old API of Get(string) is translated to the new API
          underneath.
      
          Here is the summary for improvements:
      
          value 100 byte: 1.8% regular, 1.2% merge values
          value 1k byte: 11.5% regular, 7.5% merge values
          value 10k byte: 26% regular, 29.9% merge values
          The improvement for merge could be more if we extend this approach to
          pin the merge output and delay the full merge operation until the user
          actually needs it. We have put that for future work.
      
          PS:
          Sometimes we observe a small decrease in performance when switching from
          t5452014 to this patch but with the old Get(string) API. The d
      Closes https://github.com/facebook/rocksdb/pull/1756
      
      Differential Revision: D4391738
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 6f3edd3
      11526252
  7. 01 3月, 2017 1 次提交
  8. 14 2月, 2017 1 次提交
    • Y
      Make DBImpl::has_unpersisted_data_ atomic · c2247dc1
      Yi Wu 提交于
      Summary:
      Seems to me `has_unpersisted_data_` is read from read thread and write
      from write thread concurrently without synchronization. Making it an
      atomic.
      
      I update the logic not because seeing any problem with it, but it just
      feel confusing.
      Closes https://github.com/facebook/rocksdb/pull/1869
      
      Differential Revision: D4555837
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: eff2ab8
      c2247dc1
  9. 07 2月, 2017 1 次提交
  10. 25 1月, 2017 1 次提交
  11. 21 1月, 2017 1 次提交
  12. 20 1月, 2017 1 次提交
    • R
      Fix for 2PC causing WAL to grow too large · 5cf176ca
      Reid Horuff 提交于
      Summary:
      Consider the following single column family scenario:
      prepare in log A
      commit in log B
      *WAL is too large, flush all CFs to releast log A*
      *CFA is on log B so we do not see CFA is depending on log A so no flush is requested*
      
      To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
      Closes https://github.com/facebook/rocksdb/pull/1768
      
      Differential Revision: D4403265
      
      Pulled By: reidHoruff
      
      fbshipit-source-id: ce800ff
      5cf176ca
  13. 09 1月, 2017 2 次提交
    • M
      Revert "PinnableSlice" · d0ba8ec8
      Maysam Yabandeh 提交于
      Summary:
      This reverts commit 54d94e9c.
      
      The pull request was landed by mistake.
      Closes https://github.com/facebook/rocksdb/pull/1755
      
      Differential Revision: D4391678
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 36d5149
      d0ba8ec8
    • M
      PinnableSlice · 54d94e9c
      Maysam Yabandeh 提交于
      Summary:
      Currently the point lookup values are copied to a string provided by the user.
      This incures an extra memcpy cost. This patch allows doing point lookup
      via a PinnableSlice which pins the source memory location (instead of
      copying their content) and releases them after the content is consumed
      by the user. The old API of Get(string) is translated to the new API
      underneath.
      
       Here is the summary for improvements:
       1. value 100 byte: 1.8%  regular, 1.2% merge values
       2. value 1k   byte: 11.5% regular, 7.5% merge values
       3. value 10k byte: 26% regular,    29.9% merge values
      
       The improvement for merge could be more if we extend this approach to
       pin the merge output and delay the full merge operation until the user
       actually needs it. We have put that for future work.
      
      PS:
      Sometimes we observe a small decrease in performance when switching from
      t5452014 to this patch but with the old Get(string) API. The difference
      is a little and could be noise. More importantly it is safely
      cancelled
      Closes https://github.com/facebook/rocksdb/pull/1732
      
      Differential Revision: D4374613
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a077f1a
      54d94e9c
  14. 29 12月, 2016 1 次提交
  15. 07 12月, 2016 1 次提交
  16. 06 12月, 2016 1 次提交
  17. 22 11月, 2016 1 次提交
  18. 15 11月, 2016 1 次提交
  19. 12 11月, 2016 1 次提交
  20. 05 11月, 2016 1 次提交
    • A
      DeleteRange user iterator support · 9e7cf346
      Andrew Kryczka 提交于
      Summary:
      Note: reviewed in  https://reviews.facebook.net/D65115
      
      - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
      - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
      - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
      Closes https://github.com/facebook/rocksdb/pull/1464
      
      Differential Revision: D4131753
      
      Pulled By: ajkr
      
      fbshipit-source-id: be86559
      9e7cf346
  21. 22 10月, 2016 1 次提交
  22. 21 10月, 2016 1 次提交
    • I
      Support IngestExternalFile (remove AddFile restrictions) · 869ae5d7
      Islam AbdelRahman 提交于
      Summary:
      Changes in the diff
      
      API changes:
      - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
      - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
      - Deprecate AddFile() API
      
      Logic changes:
      - If our file overlap with the memtable we will flush the memtable
      - We will find the first level in the LSM tree that our file key range overlap with the keys in it
      - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
      - We will assign a global sequence number to our new file
      - Remove AddFile restrictions by using global sequence numbers
      
      Other changes:
      - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
      
      Test Plan:
      unit tests (still need to add more)
      addfile_stress (https://reviews.facebook.net/D65037)
      
      Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: jkedgar, hcz, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D65061
      869ae5d7
  23. 19 10月, 2016 1 次提交
  24. 15 10月, 2016 1 次提交
  25. 14 10月, 2016 1 次提交
    • I
      Fix compaction conflict with running compaction · 5691a1d8
      Islam AbdelRahman 提交于
      Summary:
      Issue scenario:
      (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
      (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
      (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency
      
      Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()
      
      Test Plan: unit tests
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D64947
      5691a1d8
  26. 29 9月, 2016 1 次提交
    • I
      Fix conflict between AddFile() and CompactRange() · 87dfc1d2
      Islam AbdelRahman 提交于
      Summary:
      Fix the conflict bug between AddFile() and CompactRange() by
      - Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction
      - If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it
      
      This will solve these 2 scenarios
      - If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction
      - If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction
      
      Test Plan: unit tests
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, yoshinorim, jkedgar, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64449
      87dfc1d2
  27. 27 9月, 2016 1 次提交
    • I
      Fix AddFile() conflict with compaction output [WaitForAddFile()] · 5c64fb67
      Islam AbdelRahman 提交于
      Summary:
      Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file
      
      this diff add
      - WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now
      - Call `WaitForAddFile()` in 3 locations
      -- When doing manual Compaction
      -- When starting automatic Compaction
      -- When  doing CompactFiles()
      
      Test Plan: unit test
      
      Reviewers: lightmark, yiwu, andrewkr, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, yoshinorim, jkedgar, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64383
      5c64fb67
  28. 24 9月, 2016 2 次提交
    • Y
      Split DBOptions into ImmutableDBOptions and MutableDBOptions · 9ed928e7
      Yi Wu 提交于
      Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
      
      Test Plan:
        make all check
      
      Reviewers: yhchiang, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D64065
      9ed928e7
    • Y
      Recover same sequence id from WAL (#1350) · 4bc8c88e
      yiwu-arbug 提交于
      Summary:
      Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.
      
      This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.
      
      Test Plan: Added unit test.
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64275
      4bc8c88e
  29. 20 9月, 2016 2 次提交
    • S
      DBImpl::GetWalPreallocateBlockSize() should return size_t · d78a4401
      sdong 提交于
      Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t.
      
      Test Plan: Run existing tests.
      
      Reviewers: andrewkr, IslamAbdelRahman, yiwu
      
      Reviewed By: yiwu
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D64137
      d78a4401
    • S
      Consider more factors when determining preallocation size of WAL files · b666f854
      sdong 提交于
      Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.
      
      Test Plan: Add a unit test.
      
      Reviewers: andrewkr, yiwu
      
      Reviewed By: yiwu
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D63957
      b666f854
  30. 15 9月, 2016 1 次提交
  31. 09 9月, 2016 1 次提交
  32. 30 8月, 2016 1 次提交
    • A
      support Prev() in prefix seek mode · 2482d5fb
      Aaron Gao 提交于
      Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode
      
      Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)
      
      Reviewers: andrewkr, sdong, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: yoshinorim, andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D61419
      2482d5fb
  33. 20 8月, 2016 1 次提交
    • W
      TableBuilder / TableReader support for range deletion · 78837f5d
      Wanning Jiang 提交于
      Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
      
      Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
      
      Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
      
      Reviewed By: andrewkr
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D61473
      78837f5d
  34. 11 8月, 2016 1 次提交
    • S
      read_options.background_purge_on_iterator_cleanup to cover forward iterator... · 56dd0341
      sdong 提交于
      read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too.
      
      Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
      
      Test Plan: I am adding unit tests.
      
      Reviewers: andrewkr, IslamAbdelRahman, yiwu
      
      Reviewed By: yiwu
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D61503
      56dd0341
  35. 10 8月, 2016 1 次提交
  36. 03 8月, 2016 1 次提交
    • Y
      Ignore write stall triggers when auto-compaction is disabled · ee027fc1
      Yi Wu 提交于
      Summary:
      My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.
      
      Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.
      
      Test Plan: See the new test. Removed two tests that are no longer valid.
      
      Reviewers: IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D61437
      ee027fc1