1. 17 3月, 2017 9 次提交
    • A
      md: superblock changes for PPL · ea0213e0
      Artur Paszkiewicz 提交于
      Include information about PPL location and size into mdp_superblock_1
      and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
      put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
      'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
      to mddev->flags to indicate that PPL is enabled on an array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea0213e0
    • S
      md/r5cache: improve recovery with read ahead page pool · effe6ee7
      Song Liu 提交于
      In r5cache recovery, the journal device is scanned page by page.
      Currently, we use sync_page_io() to read journal device. This is
      not efficient when we have to recovery many stripes from the journal.
      
      To improve the speed of recovery, this patch introduces a read ahead
      page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
      pages are read in one IO. Then the recovery code read the journal from
      ra_pool.
      
      With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
      r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
      stack space.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      effe6ee7
    • S
      md/raid5: sort bios · aaf9f12e
      Shaohua Li 提交于
      Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
      defers IO dispatching. The goal is to create better IO pattern. At that
      time, we don't sort the deffered IO and hope the block layer can do IO
      merge and sort. Now the raid5-cache writeback could create large amount
      of bios. And if we enable muti-thread for stripe handling, we can't
      control when to dispatch IO to raid disks. In a lot of time, we are
      dispatching IO which block layer can't do merge effectively.
      
      This patch moves further for the IO dispatching defer. We accumulate
      bios, but we don't dispatch all the bios after a threshold is met. This
      'dispatch partial portion of bios' stragety allows bios coming in a
      large time window are sent to disks together. At the dispatching time,
      there is large chance the block layer can merge the bios. To make this
      more effective, we dispatch IO in ascending order. This increases
      request merge chance and reduces disk seek.
      Signed-off-by: NShaohua Li <shli@fb.com>
      aaf9f12e
    • S
      md/raid5-cache: bump flush stripe batch size · 84890c03
      Shaohua Li 提交于
      Bump the flush stripe batch size to 2048. For my 12 disks raid
      array, the stripes takes:
      12 * 4k * 2048 = 96MB
      
      This is still quite small. A hardware raid card generally has 1GB size,
      which we suggest the raid5-cache has similar cache size.
      
      The advantage of a big batch size is we can dispatch a lot of IO in the
      same time, then we can do some scheduling to make better IO pattern.
      
      Last patch prioritizes stripes, so we don't worry about a big flush
      stripe batch will starve normal stripes.
      Signed-off-by: NShaohua Li <shli@fb.com>
      84890c03
    • S
      md/raid5: prioritize stripes for writeback · 535ae4eb
      Shaohua Li 提交于
      In raid5-cache writeback mode, we have two types of stripes to handle.
      - stripes which aren't cached yet
      - stripes which are cached and flushing out to raid disks
      
      Upperlayer is more sensistive to latency of the first type of stripes
      generally. But we only one handle list for all these stripes, where the
      two types of stripes are mixed together. When reclaim flushes a lot of
      stripes, the first type of stripes could be noticeably delayed. On the
      other hand, if the log space is tight, we'd like to handle the second
      type of stripes faster and free log space.
      
      This patch destinguishes the two types stripes. They are added into
      different handle list. When we try to get a stripe to handl, we prefer
      the first type of stripes unless log space is tight.
      
      This should have no impact for !writeback case.
      Signed-off-by: NShaohua Li <shli@fb.com>
      535ae4eb
    • G
      md-cluster: add the support for resize · 818da59f
      Guoqing Jiang 提交于
      To update size for cluster raid, we need to make
      sure all nodes can perform the change successfully.
      However, it is possible that some of them can't do
      it due to failure (bitmap_resize could fail). So
      we need to consider the issue before we set the
      capacity unconditionally, and we use below steps
      to perform sanity check.
      
      1. A change the size, then broadcast METADATA_UPDATED
         msg.
      2. B and C receive METADATA_UPDATED change the size
         excepts call set_capacity, sync_size is not update
         if the change failed. Also call bitmap_update_sb
         to sync sb to disk.
      3. A checks other node's sync_size, if sync_size has
         been updated in all nodes, then send CHANGE_CAPACITY
         msg otherwise send msg to revert previous change.
      4. B and C call set_capacity if receive CHANGE_CAPACITY
         msg, otherwise pers->resize will be called to restore
         the old value.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      818da59f
    • G
      md-cluster: introduce cluster_check_sync_size · b98938d1
      Guoqing Jiang 提交于
      Support resize is a little complex for clustered
      raid, since we need to ensure all the nodes share
      the same knowledge about the size of raid.
      
      We achieve the goal by check the sync_size which
      is in each node's bitmap, we can only change the
      capacity after cluster_check_sync_size returns 0.
      
      Also, get_bitmap_from_slot is added to get a slot's
      bitmap. And we exported some funcs since they are
      used in cluster_check_sync_size().
      
      We can also reuse get_bitmap_from_slot to remove
      redundant code existed in bitmap_copy_from_slot.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b98938d1
    • G
      md-cluster: add CHANGE_CAPACITY message type · 7da3d203
      Guoqing Jiang 提交于
      The msg type CHANGE_CAPACITY is introduced to support
      resize clustered raid in later patch, and it is sent
      after all the nodes have the same sync_size, receiver
      node just need to set new capacity once received this
      msg.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7da3d203
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  2. 15 3月, 2017 11 次提交
  3. 14 3月, 2017 5 次提交
  4. 13 3月, 2017 4 次提交
  5. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  6. 11 3月, 2017 7 次提交
  7. 10 3月, 2017 3 次提交
    • A
      net: phy: marvell: Fix double free of hwmon device · 29673983
      Andrew Lunn 提交于
      The hwmon temperature sensor devices is registered using a devm_hwmon
      API call.  The marvell_release() would then manually free the device,
      not using a devm_hmon API, resulting in the device being removed
      twice, leading to a crash in kernfs_find_ns() during the second
      removal.
      
      Remove the manual removal, which makes marvell_release() empty, so
      remove it as well.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Fixes: 0b04680f ("phy: marvell: Add support for temperature sensor")
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29673983
    • L
      powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops · 46f401c4
      Larry Finger 提交于
      Commit 5657933d ("treewide: Move dma_ops from struct dev_archdata
      into struct device") introduced a crash for macio devices, an example
      backtrace being:
      
        kernel BUG at ./include/linux/dma-mapping.h:465!
        Oops: Exception in kernel mode, sig: 5 [#1]
        ...
        NIP [c031ddb0] dmam_alloc_coherent+0x74/0x140
        LR [c031de70] dmam_alloc_coherent+0x134/0x140
        Call Trace:
         dmam_alloc_coherent+0x134/0x140 (unreliable)
         pata_macio_port_start+0x3c/0x8c
         ata_host_start.part.5+0xfc/0x208
         ata_host_activate+0x128/0x154
         pata_macio_common_init+0x2f0/0x538
         pata_macio_attach+0xd8/0x180
         macio_device_probe+0x5c/0xec
         driver_probe_device+0x21c/0x314
         __driver_attach+0xcc/0xd0
         bus_for_each_dev+0x68/0xb4
         bus_add_driver+0x1dc/0x244
         driver_register+0x88/0x130
         pata_macio_init+0x5c/0x88
         do_one_initcall+0x40/0x170
         kernel_init_freeable+0x134/0x1d0
         kernel_init+0x18/0x110
         ret_from_kernel_thread+0x5c/0x64
      
      This was caused by the device having NULL dma_ops, triggering the
      BUG_ON(). Previously the device inherited its dma_ops via the assignment
      to dev->ofdev.dev.archdata. However after commit 5657933d the
      dma_ops are moved into dev->ofdev.dev, and so they need to be explicitly
      copied.
      
      Fixes: 5657933d ("treewide: Move dma_ops from struct dev_archdata into struct device")
      Signed-off-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Suggested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rewrite change log, add backtrace]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      46f401c4
    • D
      net: bcmgenet: decouple flow control from bcmgenet_tx_reclaim · 6d22fe14
      Doug Berger 提交于
      The bcmgenet_tx_reclaim() function is used to reclaim transmit
      resources in different places within the driver.  Most of them
      should not affect the state of the transmit flow control.
      
      This commit relocates the logic for waking tx queues based on
      freed resources to the napi polling function where it is more
      appropriate.
      
      Fixes: 1c1008c7 ("net: bcmgenet: add main driver file")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d22fe14