1. 14 6月, 2016 9 次提交
    • L
      drbd: al_write_transaction: skip re-scanning of bitmap page pointer array · 27ea1d87
      Lars Ellenberg 提交于
      For larger devices, the array of bitmap page pointers can grow very
      large (8000 pointers per TB of storage).
      
      For each activity log transaction, we need to flush the associated
      bitmap pages to stable storage. Currently, we just "mark" the respective
      pages while setting up the transaction, then tell the bitmap code to
      write out all marked pages, but skip unchanged pages.
      
      But one such transaction can affect only a small number of bitmap pages,
      there is no need to scan the full array of several (ten-)thousand
      page pointers to find the few marked ones.
      
      Instead, remember the index numbers of the few affected pages,
      and later only re-check those to skip duplicates and unchanged ones.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      27ea1d87
    • F
      drbd: code cleanups without semantic changes · 7e5fec31
      Fabian Frederick 提交于
      This contains various cosmetic fixes ranging from simple typos to
      const-ifying, and using booleans properly.
      
      Original commit messages from Fabian's patch set:
      drbd: debugfs: constify drbd_version_fops
      drbd: use seq_put instead of seq_print where possible
      drbd: include linux/uaccess.h instead of asm/uaccess.h
      drbd: use const char * const for drbd strings
      drbd: kerneldoc warning fix in w_e_end_data_req()
      drbd: use unsigned for one bit fields
      drbd: use bool for peer is_ states
      drbd: fix typo
      drbd: use | for bitmask combination
      drbd: use true/false for bool
      drbd: fix drbd_bm_init() comments
      drbd: introduce peer state union
      drbd: fix maybe_pull_ahead() locking comments
      drbd: use bool for growing
      drbd: remove redundant declarations
      drbd: replace if/BUG by BUG_ON
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7e5fec31
    • L
      drbd: introduce WRITE_SAME support · 9104d31a
      Lars Ellenberg 提交于
      We will support WRITE_SAME, if
       * all peers support WRITE_SAME (both in kernel and DRBD version),
       * all peer devices support WRITE_SAME
       * logical_block_size is identical on all peers.
      
      We may at some point introduce a fallback on the receiving side
      for devices/kernels that do not support WRITE_SAME,
      by open-coding a submit loop. But not yet.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9104d31a
    • L
      drbd: introduce unfence-peer handler · 26a96110
      Lars Ellenberg 提交于
      When resync is finished, we already call the "after-resync-target"
      handler (on the former sync target, obviously), once per volume.
      
      Paired with the before-resync-target handler, you can create snapshots,
      before the resync causes the volumes to become inconsistent,
      and discard those snapshots again, once they are no longer needed.
      
      It was also overloaded to be paired with the "fence-peer" handler,
      to "unfence" once the volumes are up-to-date and known good.
      
      This has some disadvantages, though: we call "fence-peer" for the whole
      connection (once for the group of volumes), but would call unfence as
      side-effect of after-resync-target once for each volume.
      
      Also, we fence on a (current, or about to become) Primary,
      which will later become the sync-source.
      
      Calling unfence only as a side effect of the after-resync-target
      handler opens a race window, between a new fence on the Primary
      (SyncTarget) and the unfence on the SyncTarget, which is difficult to
      close without some kind of "cluster wide lock" in those handlers.
      
      We would not need those handlers if we could still communicate.
      Which makes trying to aquire a cluster wide lock from those handlers
      seem like a very bad idea.
      
      This introduces the "unfence-peer" handler, which will be called
      per connection (once for the group of volumes), just like the fence
      handler, only once all volumes are back in sync, and on the SyncSource.
      
      Which is expected to be the node that previously called "fence", the
      node that is currently allowed to be Primary, and thus the only node
      that could trigger a new "fence" that could race with this unfence.
      
      Which makes us not need any cluster wide synchronization here,
      serializing two scripts running on the same node is trivial.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      26a96110
    • L
      drbd: finish resync on sync source only by notification from sync target · 5052fee2
      Lars Ellenberg 提交于
      If the replication link breaks exactly during "resync finished" detection,
      finishing too early on the sync source could again lead to UUIDs rotated
      too fast, and potentially a spurious full resync on next handshake.
      
      Always wait for explicit resync finished state change notification from
      the sync target.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5052fee2
    • L
      drbd: allow larger max_discard_sectors · 505675f9
      Lars Ellenberg 提交于
      Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
      al-extents available, and allow up to half of that to be
      discarded in one bio.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      505675f9
    • L
      drbd: zero-out partial unaligned discards on local backend · 7435e901
      Lars Ellenberg 提交于
      For consistency, also zero-out partial unaligned chunks of discard
      requests on the local backend.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7435e901
    • L
      drbd: when receiving P_TRIM, zero-out partial unaligned chunks · dd4f699d
      Lars Ellenberg 提交于
      We can avoid spurious data divergence caused by partially-ignored
      discards on certain backends with discard_zeroes_data=0, if we
      translate partial unaligned discard requests into explicit zero-out.
      
      The relevant use case is LVM/DM thin.
      
      If on different nodes, DRBD is backed by devices with differing
      discard characteristics, discards may lead to data divergence
      (old data or garbage left over on one backend, zeroes due to
      unmapped areas on the other backend). Online verify would now
      potentially report tons of spurious differences.
      
      While probably harmless for most use cases (fstrim on a file system),
      DRBD cannot have that, it would violate our promise to upper layers
      that our data instances on the nodes are identical.
      
      To be correct and play safe (make sure data is identical on both copies),
      we would have to disable discard support, if our local backend (on a
      Primary) does not support "discard_zeroes_data=true".
      
      We'd also have to translate discards to explicit zero-out on the
      receiving (typically: Secondary) side, unless the receiving side
      supports "discard_zeroes_data=true".
      
      Which both would allocate those blocks, instead of unmapping them,
      in contrast with expectations.
      
      LVM/DM thin does set discard_zeroes_data=0,
      because it silently ignores discards to partial chunks.
      
      We can work around this by checking the alignment first.
      For unaligned (wrt. alignment and granularity) or too small discards,
      we zero-out the initial (and/or) trailing unaligned partial chunks,
      but discard all the aligned full chunks.
      
      At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
      
      Arguably it should behave this way internally, by default,
      and we'll try to make that happen.
      
      But our workaround is still valid for already deployed setups,
      and for other devices that may behave this way.
      
      Setting discard-zeroes-if-aligned=yes will allow DRBD to use
      discards, and to announce discard_zeroes_data=true, even on
      backends that announce discard_zeroes_data=false.
      
      Setting discard-zeroes-if-aligned=no will cause DRBD to always
      fall-back to zero-out on the receiving side, and to not even
      announce discard capabilities on the Primary, if the respective
      backend announces discard_zeroes_data=false.
      
      We used to ignore the discard_zeroes_data setting completely.
      To not break established and expected behaviour, and suddenly
      cause fstrim on thin-provisioned LVs to run out-of-space,
      instead of freeing up space, the default value is "yes".
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd4f699d
    • P
      drbd: Implement handling of thinly provisioned storage on resync target nodes · 700ca8c0
      Philipp Reisner 提交于
      If during resync we read only zeroes for a range of sectors assume
      that these secotors can be discarded on the sync target node.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      700ca8c0
  2. 08 6月, 2016 1 次提交
  3. 05 4月, 2016 1 次提交
  4. 27 1月, 2016 1 次提交
  5. 23 1月, 2016 1 次提交
  6. 26 11月, 2015 10 次提交
  7. 08 11月, 2015 1 次提交
  8. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
  9. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  10. 02 6月, 2015 1 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
  11. 11 11月, 2014 3 次提交
  12. 11 9月, 2014 3 次提交
  13. 11 7月, 2014 7 次提交
    • L
      drbd: debugfs: add callback_history · 944410e9
      Lars Ellenberg 提交于
      Add a per-connection worker thread callback_history
      with timing details, call site and callback function.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      944410e9
    • L
      drbd: debugfs: Add in_flight_summary · f418815f
      Lars Ellenberg 提交于
      * Add details about pending meta data operations to in_flight_summary.
      
      * Report number of requests waiting for activity log transactions.
      
      * timing details of peer_requests to in_flight_summary.
      
      * FLUSH details
        DRBD devides the incoming request stream into "epochs",
        in which peers are allowed to re-order writes independendly.
      
        These epochs are separated by P_BARRIER on the replication link.
        Such barrier packets, depending on configuration, may cause
        the receiving side to drain the lower level device request queues
        and call blkdev_issue_flush().
      
        This is known to be an other major source of latency in DRBD.
      
        Track timing details of calls to blkdev_issue_flush(),
        and add them to in_flight_summary.
      
      * data socket stats
        To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
        it is useful to see network buffer stats along with the timing details of
        requests, peer requests, and meta data IO.
      
      * pending bitmap IO timing details to in_flight_summary.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      f418815f
    • L
      drbd: debugfs: add basic hierarchy · 4d3d5aa8
      Lars Ellenberg 提交于
      Add new debugfs hierarchy /sys/kernel/debug/
        drbd/
          resources/
            $resource_name/connections/peer/$volume_number/
            $resource_name/volumes/$volume_number/
          minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/
      
      Followup commits will populate this hierarchy with files containing
      statistics, diagnostic information and some attribute data.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      4d3d5aa8
    • L
      drbd: track details of bitmap IO · 4ce49266
      Lars Ellenberg 提交于
      Track start and submit time of bitmap operations, and
      add pending bitmap IO contexts to a new pending_bitmap_io list.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      4ce49266
    • L
      drbd: track timing details of peer_requests · 21ae5d7f
      Lars Ellenberg 提交于
      To be able to present timing details in debugfs,
      we need to track preparation/submit times of peer requests.
      
      Track peer request flags early,
      before they are put on the epoch_entry lists.
      
      Waiting for activity log transactions may be a major latency factor.
      We want to be able to present the peer_request state accurately in
      debugfs, and what it is waiting for.
      
      Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
      Set it only *after* calling drbd_al_begin_io(),
      clear it as soon as we call drbd_al_complete_io().
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      21ae5d7f
    • L
      drbd: improve throttling decisions of background resynchronisation · ad3fee79
      Lars Ellenberg 提交于
      Background resynchronisation does some "side-stepping", or throttles
      itself, if it detects application IO activity, and the current resync
      rate estimate is above the configured "cmin-rate".
      
      What was not detected: if there is no application IO,
      because it blocks on activity log transactions.
      
      Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
      and count non-zero as application IO activity.
      This counter is exposed at proc_details level 2 and above.
      
      Also make sure to release the currently locked resync extent
      if we side-step due to such voluntary throttling.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      ad3fee79
    • L
      drbd: add caching oldest request pointers for replication stages · 7753a4c1
      Lars Ellenberg 提交于
      A request that is to be shipped to the peer goes through a few stages:
      - queued
      - sent, waiting for ack
      - ack received, waiting for "barrier ack", which is re-order epoch being
        closed on the peer by acknowledging a "cache flush" equivalent
        on the lower level device.
      
      In the later two stages, depending on protocol, we may have already
      completed this request to the upper layers, so it won't be found anymore
      on device->pending_master_completion[] lists.
      
      Track the oldest request yet to be sent (req_next), the oldest not yet
      acknowledged (req_ack_pending) and the oldest "still waiting for
      something from the peer" (req_not_net_done), doing short list walks on
      the transfer log to find the next pending one whenever such a request
      makes progress.
      
      Now we have a fast way to look up the oldest requests,
      don't do a transfer log walk every time.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      7753a4c1