1. 17 3月, 2017 8 次提交
    • S
      md/r5cache: improve recovery with read ahead page pool · effe6ee7
      Song Liu 提交于
      In r5cache recovery, the journal device is scanned page by page.
      Currently, we use sync_page_io() to read journal device. This is
      not efficient when we have to recovery many stripes from the journal.
      
      To improve the speed of recovery, this patch introduces a read ahead
      page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
      pages are read in one IO. Then the recovery code read the journal from
      ra_pool.
      
      With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
      r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
      stack space.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      effe6ee7
    • S
      md/raid5: sort bios · aaf9f12e
      Shaohua Li 提交于
      Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
      defers IO dispatching. The goal is to create better IO pattern. At that
      time, we don't sort the deffered IO and hope the block layer can do IO
      merge and sort. Now the raid5-cache writeback could create large amount
      of bios. And if we enable muti-thread for stripe handling, we can't
      control when to dispatch IO to raid disks. In a lot of time, we are
      dispatching IO which block layer can't do merge effectively.
      
      This patch moves further for the IO dispatching defer. We accumulate
      bios, but we don't dispatch all the bios after a threshold is met. This
      'dispatch partial portion of bios' stragety allows bios coming in a
      large time window are sent to disks together. At the dispatching time,
      there is large chance the block layer can merge the bios. To make this
      more effective, we dispatch IO in ascending order. This increases
      request merge chance and reduces disk seek.
      Signed-off-by: NShaohua Li <shli@fb.com>
      aaf9f12e
    • S
      md/raid5-cache: bump flush stripe batch size · 84890c03
      Shaohua Li 提交于
      Bump the flush stripe batch size to 2048. For my 12 disks raid
      array, the stripes takes:
      12 * 4k * 2048 = 96MB
      
      This is still quite small. A hardware raid card generally has 1GB size,
      which we suggest the raid5-cache has similar cache size.
      
      The advantage of a big batch size is we can dispatch a lot of IO in the
      same time, then we can do some scheduling to make better IO pattern.
      
      Last patch prioritizes stripes, so we don't worry about a big flush
      stripe batch will starve normal stripes.
      Signed-off-by: NShaohua Li <shli@fb.com>
      84890c03
    • S
      md/raid5: prioritize stripes for writeback · 535ae4eb
      Shaohua Li 提交于
      In raid5-cache writeback mode, we have two types of stripes to handle.
      - stripes which aren't cached yet
      - stripes which are cached and flushing out to raid disks
      
      Upperlayer is more sensistive to latency of the first type of stripes
      generally. But we only one handle list for all these stripes, where the
      two types of stripes are mixed together. When reclaim flushes a lot of
      stripes, the first type of stripes could be noticeably delayed. On the
      other hand, if the log space is tight, we'd like to handle the second
      type of stripes faster and free log space.
      
      This patch destinguishes the two types stripes. They are added into
      different handle list. When we try to get a stripe to handl, we prefer
      the first type of stripes unless log space is tight.
      
      This should have no impact for !writeback case.
      Signed-off-by: NShaohua Li <shli@fb.com>
      535ae4eb
    • G
      md-cluster: add the support for resize · 818da59f
      Guoqing Jiang 提交于
      To update size for cluster raid, we need to make
      sure all nodes can perform the change successfully.
      However, it is possible that some of them can't do
      it due to failure (bitmap_resize could fail). So
      we need to consider the issue before we set the
      capacity unconditionally, and we use below steps
      to perform sanity check.
      
      1. A change the size, then broadcast METADATA_UPDATED
         msg.
      2. B and C receive METADATA_UPDATED change the size
         excepts call set_capacity, sync_size is not update
         if the change failed. Also call bitmap_update_sb
         to sync sb to disk.
      3. A checks other node's sync_size, if sync_size has
         been updated in all nodes, then send CHANGE_CAPACITY
         msg otherwise send msg to revert previous change.
      4. B and C call set_capacity if receive CHANGE_CAPACITY
         msg, otherwise pers->resize will be called to restore
         the old value.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      818da59f
    • G
      md-cluster: introduce cluster_check_sync_size · b98938d1
      Guoqing Jiang 提交于
      Support resize is a little complex for clustered
      raid, since we need to ensure all the nodes share
      the same knowledge about the size of raid.
      
      We achieve the goal by check the sync_size which
      is in each node's bitmap, we can only change the
      capacity after cluster_check_sync_size returns 0.
      
      Also, get_bitmap_from_slot is added to get a slot's
      bitmap. And we exported some funcs since they are
      used in cluster_check_sync_size().
      
      We can also reuse get_bitmap_from_slot to remove
      redundant code existed in bitmap_copy_from_slot.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b98938d1
    • G
      md-cluster: add CHANGE_CAPACITY message type · 7da3d203
      Guoqing Jiang 提交于
      The msg type CHANGE_CAPACITY is introduced to support
      resize clustered raid in later patch, and it is sent
      after all the nodes have the same sync_size, receiver
      node just need to set new capacity once received this
      msg.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7da3d203
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  2. 15 3月, 2017 2 次提交
  3. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  4. 11 3月, 2017 2 次提交
  5. 10 3月, 2017 8 次提交
  6. 02 3月, 2017 8 次提交
    • I
      sched/headers: Prepare to move the get_task_struct()/put_task_struct() and... · 0881e7bd
      Ingo Molnar 提交于
      sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h>
      
      But first update usage sites with the new header dependency.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0881e7bd
    • I
      sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h> · b2d09103
      Ingo Molnar 提交于
      We don't actually need the full rculist.h header in sched.h anymore,
      we will be able to include the smaller rcupdate.h header instead.
      
      But first update code that relied on the implicit header inclusion.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b2d09103
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> · 68db0cf1
      Ingo Molnar 提交于
      We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/task_stack.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      68db0cf1
    • I
      sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h> · 5b3cc15a
      Ingo Molnar 提交于
      Update the .c files that depend on these APIs.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5b3cc15a
    • I
      sched/headers: Prepare to move signal wakeup & sigpending methods from... · 174cd4b1
      Ingo Molnar 提交于
      sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
      
      Fix up affected files that include this signal functionality via sched.h.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      174cd4b1
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> · 3f07c014
      Ingo Molnar 提交于
      We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/signal.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3f07c014
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> · e6017571
      Ingo Molnar 提交于
      We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and .c files.
      
      Create a trivial placeholder <linux/sched/clock.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e6017571
    • D
      KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload() · 0837e49a
      David Howells 提交于
      rcu_dereference_key() and user_key_payload() are currently being used in
      two different, incompatible ways:
      
       (1) As a wrapper to rcu_dereference() - when only the RCU read lock used
           to protect the key.
      
       (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
           used to protect the key and the may be being modified.
      
      Fix this by splitting both of the key wrappers to produce:
      
       (1) RCU accessors for keys when caller has the key semaphore locked:
      
      	dereference_key_locked()
      	user_key_payload_locked()
      
       (2) RCU accessors for keys when caller holds the RCU read lock:
      
      	dereference_key_rcu()
      	user_key_payload_rcu()
      
      This should fix following warning in the NFS idmapper
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.10.0 #1 Tainted: G        W
        -------------------------------
        ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 0
        1 lock held by mount.nfs/5987:
          #0:  (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
        stack backtrace:
        CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G        W       4.10.0 #1
        Call Trace:
          dump_stack+0xe8/0x154 (unreliable)
          lockdep_rcu_suspicious+0x140/0x190
          nfs_idmap_get_key+0x380/0x420 [nfsv4]
          nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
          decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
          decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
          nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
          rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
          call_decode+0x29c/0x910 [sunrpc]
          __rpc_execute+0x140/0x8f0 [sunrpc]
          rpc_run_task+0x170/0x200 [sunrpc]
          nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
          _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
          nfs4_lookup_root+0xe0/0x350 [nfsv4]
          nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
          nfs4_find_root_sec+0xc4/0x100 [nfsv4]
          nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
          nfs4_get_rootfh+0x6c/0x190 [nfsv4]
          nfs4_server_common_setup+0xc4/0x260 [nfsv4]
          nfs4_create_server+0x278/0x3c0 [nfsv4]
          nfs4_remote_mount+0x50/0xb0 [nfsv4]
          mount_fs+0x74/0x210
          vfs_kern_mount+0x78/0x220
          nfs_do_root_mount+0xb0/0x140 [nfsv4]
          nfs4_try_mount+0x60/0x100 [nfsv4]
          nfs_fs_mount+0x5ec/0xda0 [nfs]
          mount_fs+0x74/0x210
          vfs_kern_mount+0x78/0x220
          do_mount+0x254/0xf70
          SyS_mount+0x94/0x100
          system_call+0x38/0xe0
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      0837e49a
  7. 01 3月, 2017 3 次提交
    • M
      dm raid: bump the target version · 2664f3c9
      Mike Snitzer 提交于
      This version bump reflects that the reshape corruption fix (commit
      92a39f6cc "dm raid: fix data corruption on reshape request") is
      present.
      
      Done as a separate fix because the above referenced commit is marked for
      stable and target version bumps in a stable@ fix are a recipe for the
      fix to never get backported to stable@ kernels (because of target
      version number conflicts).
      
      Also, move RESUME_STAY_FROZEN_FLAGS up with the reset the the _FLAGS
      definitions now that we don't need to worry about stable@ conflicts as a
      result of missing context.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2664f3c9
    • H
      dm raid: fix data corruption on reshape request · d36a1954
      Heinz Mauelshagen 提交于
      The lvm2 sequence to manage dm-raid constructor flags that trigger a
      rebuild or a reshape is defined as:
      
      1) load table with flags (e.g. rebuild/delta_disks/data_offset)
      2) clear out the flags in lvm2 metadata
      3) store the lvm2 metadata, reload the table to reset the flags
         previously established during the initial load (1) -- in order to
         prevent repeatedly requesting a rebuild or a reshape on activation
      
      Currently, loading an inactive table with rebuild/reshape flags
      specified will cause dm-raid to rebuild/reshape on resume and thus start
      updating the raid metadata (about the progress).  When the second table
      reload, to reset the flags, occurs the constructor accesses the volatile
      progress state kept in the raid superblocks.  Because the active mapping
      is still processing the rebuild/reshape, that position will be stale by
      the time the device is resumed.
      
      In the reshape case, this causes data corruption by processing already
      reshaped stripes again.  In the rebuild case, it does _not_ cause data
      corruption but instead involves superfluous rebuilds.
      
      Fix by keeping the raid set frozen during the first resume and then
      allow the rebuild/reshape during the second resume.
      
      Fixes: 9dbd1aa3 ("dm raid: add reshaping support to the target")
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.8+
      d36a1954
    • M
      dm raid: fix raid "check" regression due to improper cleanup in raid_message() · ad470472
      Mike Snitzer 提交于
      While cleaning up awkward branching in raid_message() a raid set "check"
      regression was introduced because "check" needs both MD_RECOVERY_SYNC
      and MD_RECOVERY_REQUESTED flags set.
      
      Fix this regression by explicitly setting both flags for the "check"
      case (like is also done for the "repair" case, but redundant set_bit()s
      are perfectly fine because it adds clarity to what is needed in response
      to both messages -- in addition this isn't fast path code).
      
      Fixes: 105db599 ("dm raid: cleanup awkward branching in raid_message() option processing")
      Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ad470472
  8. 25 2月, 2017 1 次提交
  9. 24 2月, 2017 3 次提交
  10. 20 2月, 2017 3 次提交
    • S
      md/raid1: fix a use-after-free bug · af5f42a7
      Shaohua Li 提交于
      Commit fd76863e (RAID1: a new I/O barrier implementation to remove resync
      window) introduces a user-after-free bug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      af5f42a7
    • C
      RAID1: avoid unnecessary spin locks in I/O barrier code · 824e47da
      colyli@suse.de 提交于
      When I run a parallel reading performan testing on a md raid1 device with
      two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
      block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
      only 2.7GB/s, this is around 50% of the idea performance number.
      
      The perf reports locking contention happens at allow_barrier() and
      wait_barrier() code,
       - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
         - _raw_spin_lock_irqsave
               + 89.92% allow_barrier
               + 9.34% __wake_up
       - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
         - _raw_spin_lock_irq
               - 100.00% wait_barrier
      
      The reason is, in these I/O barrier related functions,
       - raise_barrier()
       - lower_barrier()
       - wait_barrier()
       - allow_barrier()
      They always hold conf->resync_lock firstly, even there are only regular
      reading I/Os and no resync I/O at all. This is a huge performance penalty.
      
      The solution is a lockless-like algorithm in I/O barrier code, and only
      holding conf->resync_lock when it has to.
      
      The original idea is from Hannes Reinecke, and Neil Brown provides
      comments to improve it. I continue to work on it, and make the patch into
      current form.
      
      In the new simpler raid1 I/O barrier implementation, there are two
      wait barrier functions,
       - wait_barrier()
         Which calls _wait_barrier(), is used for regular write I/O. If there is
         resync I/O happening on the same I/O barrier bucket, or the whole
         array is frozen, task will wait until no barrier on same barrier bucket,
         or the whold array is unfreezed.
       - wait_read_barrier()
         Since regular read I/O won't interfere with resync I/O (read_balance()
         will make sure only uptodate data will be read out), it is unnecessary
         to wait for barrier in regular read I/Os, waiting in only necessary
         when the whole array is frozen.
      
      The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
      barrier[idx] are very carefully designed in raise_barrier(),
      lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
      avoid unnecessary spin locks in these functions. Once conf->
      nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
      has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
      raised in same barrier bucket index and array is not frozen, the regular
      I/O doesn't need to hold conf->resync_lock, it can just increase
      conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
      very similar to _wait_barrier(), the only difference is it only waits when
      array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
      code almostly gets rid of all spin lock cost.
      
      This patch significantly improves raid1 reading peroformance. From my
      testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
      blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
      increases from 2.7GB/s to 4.6GB/s (+70%).
      
      Changelog
      V4:
      - Change conf->nr_queued[] to atomic_t.
      - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
      V3:
      - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
      - Change conf->nr_queued[] from atomic_t to int.
      - Change conf->array_frozen from atomic_t back to int, and use
        READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
        in _wait_barrier() and wait_read_barrier().
      - In _wait_barrier() and wait_read_barrier(), add a call to
        wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
        to fix a deadlock between  _wait_barrier()/wait_read_barrier and
        freeze_array().
      V2:
      - Remove a spin_lock/unlock pair in raid1d().
      - Add more code comments to explain why there is no racy when checking two
        atomic_t variables at same time.
      V1:
      - Original RFC patch for comments.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      824e47da
    • C
      RAID1: a new I/O barrier implementation to remove resync window · fd76863e
      colyli@suse.de 提交于
      'Commit 79ef3a8a ("raid1: Rewrite the implementation of iobarrier.")'
      introduces a sliding resync window for raid1 I/O barrier, this idea limits
      I/O barriers to happen only inside a slidingresync window, for regular
      I/Os out of this resync window they don't need to wait for barrier any
      more. On large raid1 device, it helps a lot to improve parallel writing
      I/O throughput when there are background resync I/Os performing at
      same time.
      
      The idea of sliding resync widow is awesome, but code complexity is a
      challenge. Sliding resync window requires several variables to work
      collectively, this is complexed and very hard to make it work correctly.
      Just grep "Fixes: 79ef3a8a" in kernel git log, there are 8 more patches
      to fix the original resync window patch. This is not the end, any further
      related modification may easily introduce more regreassion.
      
      Therefore I decide to implement a much simpler raid1 I/O barrier, by
      removing resync window code, I believe life will be much easier.
      
      The brief idea of the simpler barrier is,
       - Do not maintain a global unique resync window
       - Use multiple hash buckets to reduce I/O barrier conflicts, regular
         I/O only has to wait for a resync I/O when both them have same barrier
         bucket index, vice versa.
       - I/O barrier can be reduced to an acceptable number if there are enough
         barrier buckets
      
      Here I explain how the barrier buckets are designed,
       - BARRIER_UNIT_SECTOR_SIZE
         The whole LBA address space of a raid1 device is divided into multiple
         barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
         Bio requests won't go across border of barrier unit size, that means
         maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
         For random I/O 64MB is large enough for both read and write requests,
         for sequential I/O considering underlying block layer may merge them
         into larger requests, 64MB is still good enough.
         Neil also points out that for resync operation, "we want the resync to
         move from region to region fairly quickly so that the slowness caused
         by having to synchronize with the resync is averaged out over a fairly
         small time frame". For full speed resync, 64MB should take less then 1
         second. When resync is competing with other I/O, it could take up a few
         minutes. Therefore 64MB size is fairly good range for resync.
      
       - BARRIER_BUCKETS_NR
         There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
              #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
              #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
         this patch makes the bellowed members of struct r1conf from integer
         to array of integers,
              -       int                     nr_pending;
              -       int                     nr_waiting;
              -       int                     nr_queued;
              -       int                     barrier;
              +       int                     *nr_pending;
              +       int                     *nr_waiting;
              +       int                     *nr_queued;
              +       int                     *barrier;
         number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
         kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
         barrier buckets, and each array of integers occupies single memory page.
         1024 means for a request which is smaller than the I/O barrier unit size
         has ~0.1% chance to wait for resync to pause, which is quite a small
         enough fraction. Also requesting single memory page is more friendly to
         kernel page allocator than larger memory size.
      
       - I/O barrier bucket is indexed by bio start sector
         If multiple I/O requests hit different I/O barrier units, they only need
         to compete I/O barrier with other I/Os which hit the same I/O barrier
         bucket index with each other. The index of a barrier bucket which a
         bio should look for is calculated by sector_to_idx() which is defined
         in raid1.h as an inline function,
              static inline int sector_to_idx(sector_t sector)
              {
                      return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                      BARRIER_BUCKETS_NR_BITS);
              }
         Here sector_nr is the start sector number of a bio.
      
       - Single bio won't go across boundary of a I/O barrier unit
         If a request goes across boundary of barrier unit, it will be split. A
         bio may be split in raid1_make_request() or raid1_sync_request(), if
         sectors returned by align_to_barrier_unit_end() is smaller than
         original bio size.
      
      Comparing to single sliding resync window,
       - Currently resync I/O grows linearly, therefore regular and resync I/O
         will conflict within a single barrier units. So the I/O behavior is
         similar to single sliding resync window.
       - But a barrier unit bucket is shared by all barrier units with identical
         barrier uinit index, the probability of conflict might be higher
         than single sliding resync window, in condition that writing I/Os
         always hit barrier units which have identical barrier bucket indexs with
         the resync I/Os. This is a very rare condition in real I/O work loads,
         I cannot imagine how it could happen in practice.
       - Therefore we can achieve a good enough low conflict rate with much
         simpler barrier algorithm and implementation.
      
      There are two changes should be noticed,
       - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
         single loop, it looks like this,
              spin_lock_irqsave(&conf->device_lock, flags);
              conf->nr_queued[idx]--;
              spin_unlock_irqrestore(&conf->device_lock, flags);
         This change generates more spin lock operations, but in next patch of
         this patch set, it will be replaced by a single line code,
              atomic_dec(&conf->nr_queueud[idx]);
         So we don't need to worry about spin lock cost here.
       - Mainline raid1 code split original raid1_make_request() into
         raid1_read_request() and raid1_write_request(). If the original bio
         goes across an I/O barrier unit size, this bio will be split before
         calling raid1_read_request() or raid1_write_request(),  this change
         the code logic more simple and clear.
       - In this patch wait_barrier() is moved from raid1_make_request() to
         raid1_write_request(). In raid_read_request(), original wait_barrier()
         is replaced by raid1_read_request().
         The differnece is wait_read_barrier() only waits if array is frozen,
         using different barrier function in different code path makes the code
         more clean and easy to read.
      Changelog
      V4:
      - Add alloc_r1bio() to remove redundant r1bio memory allocation code.
      - Fix many typos in patch comments.
      - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
      V3:
      - Rebase the patch against latest upstream kernel code.
      - Many fixes by review comments from Neil,
        - Back to use pointers to replace arraries in struct r1conf
        - Remove total_barriers from struct r1conf
        - Add more patch comments to explain how/why the values of
          BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
        - Use get_unqueued_pending() to replace get_all_pendings() and
          get_all_queued()
        - Increase bucket number from 512 to 1024
      - Change code comments format by review from Shaohua.
      V2:
      - Use bio_split() to split the orignal bio if it goes across barrier unit
        bounday, to make the code more simple, by suggestion from Shaohua and
        Neil.
      - Use hash_long() to replace original linear hash, to avoid a possible
        confilict between resync I/O and sequential write I/O, by suggestion from
        Shaohua.
      - Add conf->total_barriers to record barrier depth, which is used to
        control number of parallel sync I/O barriers, by suggestion from Shaohua.
      - In V1 patch the bellowed barrier buckets related members in r1conf are
        allocated in memory page. To make the code more simple, V2 patch moves
        the memory space into struct r1conf, like this,
              -       int                     nr_pending;
              -       int                     nr_waiting;
              -       int                     nr_queued;
              -       int                     barrier;
              +       int                     nr_pending[BARRIER_BUCKETS_NR];
              +       int                     nr_waiting[BARRIER_BUCKETS_NR];
              +       int                     nr_queued[BARRIER_BUCKETS_NR];
              +       int                     barrier[BARRIER_BUCKETS_NR];
        This change is by the suggestion from Shaohua.
      - Remove some inrelavent code comments, by suggestion from Guoqing.
      - Add a missing wait_barrier() before jumping to retry_write, in
        raid1_make_write_request().
      V1:
      - Original RFC patch for comments
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fd76863e
  11. 17 2月, 2017 1 次提交
    • M
      dm: flush queued bios when process blocks to avoid deadlock · d67a5f4b
      Mikulas Patocka 提交于
      Commit df2cb6da ("block: Avoid deadlocks with bio allocation by
      stacking drivers") created a workqueue for every bio set and code
      in bio_alloc_bioset() that tries to resolve some low-memory deadlocks
      by redirecting bios queued on current->bio_list to the workqueue if the
      system is low on memory.  However other deadlocks (see below **) may
      happen, without any low memory condition, because generic_make_request
      is queuing bios to current->bio_list (rather than submitting them).
      
      ** the related dm-snapshot deadlock is detailed here:
      https://www.redhat.com/archives/dm-devel/2016-July/msg00065.html
      
      Fix this deadlock by redirecting any bios on current->bio_list to the
      bio_set's rescue workqueue on every schedule() call.  Consequently,
      when the process blocks on a mutex, the bios queued on
      current->bio_list are dispatched to independent workqueus and they can
      complete without waiting for the mutex to be available.
      
      The structure blk_plug contains an entry cb_list and this list can contain
      arbitrary callback functions that are called when the process blocks.
      To implement this fix DM (ab)uses the onstack plug's cb_list interface
      to get its flush_current_bio_list() called at schedule() time.
      
      This fixes the snapshot deadlock - if the map method blocks,
      flush_current_bio_list() will be called and it redirects bios waiting
      on current->bio_list to appropriate workqueues.
      
      Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
      Depends-on: df2cb6da ("block: Avoid deadlocks with bio allocation by stacking drivers")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d67a5f4b