1. 28 8月, 2013 3 次提交
    • S
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li 提交于
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      multi-threading.
      
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      851c30c9
    • S
      raid5: fix stripe release order · d265d9dc
      Shaohua Li 提交于
      patch "make release_stripe lockless" changes the order stripes are released.
      Originally I thought block layer can take care of request merge, but it appears
      there are still some requests not merged. It's easy to fix the order.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d265d9dc
    • S
      raid5: make release_stripe lockless · 773ca82f
      Shaohua Li 提交于
      release_stripe still has big lock contention. We just add the stripe to a llist
      without taking device_lock. We let the raid5d thread to do the real stripe
      release, which must hold device_lock anyway. In this way, release_stripe
      doesn't hold any locks.
      
      The side effect is the released stripes order is changed. But sounds not a big
      deal, stripes are never handled in order. And I thought block layer can already
      do nice request merge, which means order isn't that important.
      
      I kept the unplug release batch, which is unnecessary with this patch from lock
      contention avoid point of view, and actually if we delete it, the stripe_head
      release_list and lru can share storage. But the unplug release batch is also
      helpful for request merge. We probably can delay wakeup raid5d till unplug, but
      I'm still afraid of the case which raid5d is running.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      773ca82f
  2. 27 8月, 2013 7 次提交
    • N
      md: avoid deadlock when dirty buffers during md_stop. · 260fa034
      NeilBrown 提交于
      When the last process closes /dev/mdX sync_blockdev will be called so
      that all buffers get flushed.
      So if it is then opened for the STOP_ARRAY ioctl to be sent there will
      be nothing to flush.
      
      However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
      moments before some other process which was writing closes their file
      descriptor, then there won't be a 'last close' and the buffers might
      not get flushed.
      
      So do_md_stop() calls sync_blockdev().  However at this point it is
      holding ->reconfig_mutex.  So if the array is currently 'clean' then
      the writes from sync_blockdev() will not complete until the array
      can be marked dirty and that won't happen until some other thread
      can get ->reconfig_mutex.  So we deadlock.
      
      We need to move the sync_blockdev() call to before we take
      ->reconfig_mutex.
      However then some other thread could open /dev/mdX and write to it
      after we call sync_blockdev() and before we actually stop the array.
      This can leave dirty data in the page cache which is awkward.
      
      So introduce new flag MD_STILL_CLOSED.  Set it before calling
      sync_blockdev(), clear it if anyone does open the file, and abort the
      STOP_ARRAY attempt if it gets set before we lock against further
      opens.
      
      It is still possible to get problems if you open /dev/mdX, write to
      it, then issue the STOP_ARRAY ioctl.  Just don't do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      260fa034
    • N
      md: Don't test all of mddev->flags at once. · 7a0a5355
      NeilBrown 提交于
      mddev->flags is mostly used to record if an update of the
      metadata is needed.  Sometimes the whole field is tested
      instead of just the important bits.  This makes it difficult
      to introduce more state bits.
      
      So replace all bare tests of mddev->flags with tests for the bits
      that actually need testing.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a0a5355
    • D
      md: Fix apparent cut-and-paste error in super_90_validate · c9ad020f
      Dave Jones 提交于
      Setting a variable to itself probably wasn't the intention here.
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c9ad020f
    • M
      raid6/test: replace echo -e with printf · c28399b5
      Max Filippov 提交于
      -e is a non-standard echo option, echo output is
      implementation-dependent when it is used. Replace echo -e with printf as
      suggested by POSIX echo manual.
      
      Cc: NeilBrown <neilb@suse.de>
      Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c28399b5
    • K
      RAID: add tilegx SIMD implementation of raid6 · ae77cbc1
      Ken Steele 提交于
      This change adds TILE-Gx SIMD instructions to the software raid
      (md), modeling the Altivec implementation. This is only for Syndrome
      generation; there is more that could be done to improve recovery,
      as in the recent Intel SSE3 recovery implementation.
      
      The code unrolls 8 times; this turns out to be the best on tilegx
      hardware among the set 1, 2, 4, 8 or 16.  The code reads one
      cache-line of data from each disk, stores P and Q then goes to the
      next cache-line.
      
      The test code in sys/linux/lib/raid6/test reports 2008 MB/s data
      read rate for syndrome generation using 18 disks (16 data and 2
      parity). It was 1512 MB/s before this SIMD optimizations. This is
      running on 1 core with all the data in cache.
      
      This is based on the paper The Mathematics of RAID-6.
      (http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).
      Signed-off-by: NKen Steele <ken@tilera.com>
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ae77cbc1
    • N
      md: fix safe_mode buglet. · 275c51c4
      NeilBrown 提交于
      Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
      immediately - otherwise the small value might not be honoured.
      However if the previous timeout was 0 meaning "no timeout", we didn't.
      This would mean that no timeout happens until the next write completes,
      which could be a long time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      275c51c4
    • N
      md: don't call md_allow_write in get_bitmap_file. · 60559da4
      NeilBrown 提交于
      There is no really need as GFP_NOIO is very likely sufficient,
      and failure is not catastrophic.
      
      Calling md_allow_write here will convert a read-auto array to
      read/write which could be confusing when you are just performing
      a read operation.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      60559da4
  3. 26 8月, 2013 5 次提交
  4. 25 8月, 2013 8 次提交
  5. 24 8月, 2013 13 次提交
    • L
      Merge branch 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · 89b53e50
      Linus Torvalds 提交于
      Pull libata fixes from Tejun Heo:
       "This contains three commits all of which are updates for specific
        devices which aren't too widespread.  Pretty limited scope and nothing
        too interesting or dangerous"
      
      * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        sata_fsl: save irqs while coalescing
        libata: apply behavioral quirks to sil3826 PMP
        sata, highbank: fix ordering of SGPIO signals
      89b53e50
    • L
      Merge branch 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · e2982a04
      Linus Torvalds 提交于
      Pull cgroup fix from Tejun Heo:
       "A late fix for cgroup.
      
        This fixes a behavior regression visible to userland which was created
        by a commit merged during -rc1.  While the behavior change isn't too
        likely to be noticeable, the fix is relatively low risk and we'll need
        to backport it through -stable anyway if the bug gets released"
      
      * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cpuset: fix a regression in validating config change
      e2982a04
    • L
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · f07823e1
      Linus Torvalds 提交于
      Pull drm fixes from Dave Airlie:
       "Ben was on holidays for a week so a few nouveau regression fixes
        backed up, but they all seem necessary.
      
        Otherwise one i915 and one gma500 fix"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
        gma500: Fix SDVO turning off randomly
        drm/nv04/disp: fix framebuffer pin refcounting
        drm/nouveau/mc: fix race condition between constructor and request_irq()
        drm/nouveau: fix reclocking on nv40
        drm/nouveau/ltcg: fix allocating memory as free
        drm/nouveau/ltcg: fix ltcg memory initialization after suspend
        drm/nouveau/fb: fix null derefs in nv49 and nv4e init
        drm/i915: Invalidate TLBs for the rings after a reset
      f07823e1
    • A
      usb: phy: fix build breakage · 52d5b9ab
      Anatolij Gustschin 提交于
      Commit 94ae9843 (usb: phy: rename all phy drivers to phy-$name-usb.c)
      renamed drivers/usb/phy/otg_fsm.h to drivers/usb/phy/phy-fsm-usb.h
      but changed drivers/usb/phy/phy-fsm-usb.c to include not existing
      "phy-otg-fsm.h" instead of new "phy-fsm-usb.h". This breaks building:
        ...
        drivers/usb/phy/phy-fsm-usb.c:32:25: fatal error: phy-otg-fsm.h: No such file or directory
        compilation terminated.
        make[3]: *** [drivers/usb/phy/phy-fsm-usb.o] Error 1
      
      This commit also missed to modify drivers/usb/phy/phy-fsl-usb.h
      to include new "phy-fsm-usb.h" instead of "otg_fsm.h" resulting
      in another build breakage:
        ...
        In file included from drivers/usb/phy/phy-fsl-usb.c:46:0:
        drivers/usb/phy/phy-fsl-usb.h:18:21: fatal error: otg_fsm.h: No such file or directory
        compilation terminated.
        make[3]: *** [drivers/usb/phy/phy-fsl-usb.o] Error 1
      
      Fix both issues.
      Signed-off-by: NAnatolij Gustschin <agust@denx.de>
      Cc: stable <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52d5b9ab
    • A
      USB: OHCI: add missing PCI PM callbacks to ohci-pci.c · 9a11899c
      Alan Stern 提交于
      Commit c1117afb (USB: OHCI: make ohci-pci a separate driver)
      neglected to preserve the entries for the pci_suspend and pci_resume
      driver callbacks.  As a result, OHCI controllers don't work properly
      during suspend and after hibernation.
      
      This patch adds the missing callbacks to the driver.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Reported-and-tested-by: NSteve Cotton <steve@s.cotton.clara.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a11899c
    • I
      staging: comedi: bug-fix NULL pointer dereference on failed attach · 3955dfa8
      Ian Abbott 提交于
      Commit dcd7b8bd ("staging: comedi: put
      module _after_ detach" by myself) reversed a couple of calls in
      `comedi_device_attach()` when recovering from an error returned by the
      low-level driver's 'attach' handler.  Unfortunately, that introduced a
      NULL pointer dereference bug as `dev->driver` is NULL after the call to
      `comedi_device_detach()`.   We still have a pointer to the low-level
      comedi driver structure in the `driv` variable, so use that instead.
      Signed-off-by: NIan Abbott <abbotti@mev.co.uk>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3955dfa8
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 41a00f79
      Linus Torvalds 提交于
      Merge networking fixes from David Miller:
      
       1) Revert Johannes Berg's genetlink locking fix, because it causes
          regressions.
      
          Johannes and Pravin Shelar are working on fixing things properly.
      
       2) Do not drop ipv6 ICMP messages without a redirected header option,
          they are legal.  From Duan Jiong.
      
       3) Missing error return propagation in probing of via-ircc driver.
          From Alexey Khoroshilov.
      
       4) Do not clear out broadcast/multicast/unicast/WOL bits in r8169 when
          initializing, from Peter Wu.
      
       5) realtek phy driver programs wrong interrupt status bit, from
          Giuseppe CAVALLARO.
      
       6) Fix statistics regression in AF_PACKET code, from Willem de Bruijn.
      
       7) Bridge code uses wrong bitmap length, from Toshiaki Makita.
      
       8) SFC driver uses wrong indexes to look up MAC filters, from Ben
          Hutchings.
      
       9) Don't pass stack buffers into usb control operations in hso driver,
          from Daniel Gimpelevich.
      
      10) Multiple ipv6 fragmentation headers in one packet is illegal and
          such packets should be dropped, from Hannes Frederic Sowa.
      
      11) When TCP sockets are "repaired" as part of checkpoint/restart, the
          timestamp field of SKBs need to be refreshed otherwise RTOs can be
          wildly off.  From Andrey Vagin.
      
      12) Fix memcpy args (uses 'address of pointer' instead of 'pointer') in
          hostp driver.  From Dan Carpenter.
      
      13) nl80211hdr_put() doesn't return an ERR_PTR, but some code believes
          it does.  From Dan Carpenter.
      
      14) Fix regression in wireless SME disconnects, from Johannes Berg.
      
      15) Don't use a stack buffer for DMA in zd1201 USB wireless driver, from
          Jussi Kivilinna.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
        ipv4: expose IPV4_DEVCONF
        ipv6: handle Redirect ICMP Message with no Redirected Header option
        be2net: fix disabling TX in be_close()
        Revert "genetlink: fix family dump race"
        hso: Fix stack corruption on some architectures
        hso: Earlier catch of error condition
        sfc: Fix lookup of default RX MAC filters when steered using ethtool
        bridge: Use the correct bit length for bitmap functions in the VLAN code
        packet: restore packet statistics tp_packets to include drops
        net: phy: rtl8211: fix interrupt on status link change
        r8169: remember WOL preferences on driver load
        via-ircc: don't return zero if via_ircc_open() failed
        macvtap: Ignore tap features when VNET_HDR is off
        macvtap: Correctly set tap features when IFF_VNET_HDR is disabled.
        macvtap: simplify usage of tap_features
        tcp: set timestamps for restored skb-s
        bnx2x: set VF DMAE when first function has 0 supported VFs
        bnx2x: Protect against VFs' ndos when SR-IOV is disabled
        bnx2x: prevent VF benign attentions
        bnx2x: Consider DCBX remote error
        ...
      41a00f79
    • L
      Merge branch 'akpm' (patches from Andrew Morton) · 3db0d4de
      Linus Torvalds 提交于
      Merge fixes from Andrew Morton:
       "A few fixes.  One is a licensing change and I don't do licensing, so
        please eyeball that one"
      
      Licensing eye-balled.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        lib/lz4: correct the LZ4 license
        memcg: get rid of swapaccount leftovers
        nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection
        nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error
        drivers/platform/olpc/olpc-ec.c: initialise earlier
      3db0d4de
    • R
      lib/lz4: correct the LZ4 license · ee8a99bd
      Richard Laager 提交于
      The LZ4 code is listed as using the "BSD 2-Clause License".
      Signed-off-by: NRichard Laager <rlaager@wiktel.com>
      Acked-by: NKyungsik Lee <kyungsik.lee@lge.com>
      Cc: Chanho Min <chanho.min@lge.com>
      Cc: Richard Yao <ryao@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ The 2-clause BSD can be just converted into GPL, but that's rude and
        pointless, so don't do it   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee8a99bd
    • M
      memcg: get rid of swapaccount leftovers · 07555ac1
      Michal Hocko 提交于
      The swapaccount kernel parameter without any values has been removed by
      commit a2c8990a ("memsw: remove noswapaccount kernel parameter") but
      it seems that we didn't get rid of all the left overs.
      
      Make sure that menuconfig help text and kernel-parameters.txt are clear
      about value for the paramter and remove the stalled comment which is not
      very much useful on its own.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NGergely Risko <gergely@risko.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07555ac1
    • V
      nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection · 4bf93b50
      Vyacheslav Dubeyko 提交于
      Fix the issue with improper counting number of flying bio requests for
      BIO_EOPNOTSUPP error detection case.
      
      The sb_nbio must be incremented exactly the same number of times as
      complete() function was called (or will be called) because
      nilfs_segbuf_wait() will call wail_for_completion() for the number of
      times set to sb_nbio:
      
        do {
            wait_for_completion(&segbuf->sb_bio_event);
        } while (--segbuf->sb_nbio > 0);
      
      Two functions complete() and wait_for_completion() must be called the
      same number of times for the same sb_bio_event.  Otherwise,
      wait_for_completion() will hang or leak.
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bf93b50
    • V
      nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error · 2df37a19
      Vyacheslav Dubeyko 提交于
      Remove double call of bio_put() in nilfs_end_bio_write() for the case of
      BIO_EOPNOTSUPP error detection.  The issue was found by Dan Carpenter
      and he suggests first version of the fix too.
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2df37a19
    • D
      drivers/platform/olpc/olpc-ec.c: initialise earlier · 93dbc1b3
      Daniel Drake 提交于
      Being a low-level component, various drivers (e.g.  olpc-battery) assume
      that it is ok to communicate with the OLPC Embedded Controller during
      probe.  Therefore the OLPC EC driver must be initialised before other
      drivers try to use it.  This was the case until it was recently moved
      out of arch/x86 and restructured around commits ac250415 ("Platform:
      OLPC: turn EC driver into a platform_driver") and 85f90cf6 ("x86:
      OLPC: switch over to using new EC driver on x86").
      
      Use arch_initcall so that olpc-ec is readied earlier, matching the
      previous behaviour.
      
      Fixes a regression introduced in Linux-3.6 where various drivers such as
      olpc-battery and olpc-xo1-sci failed to load due to an inability to
      communicate with the EC.  The user-visible effect was a lack of battery
      monitoring, missing ebook/lid switch input devices, etc.
      Signed-off-by: NDaniel Drake <dsd@laptop.org>
      Cc: Andres Salomon <dilinger@queued.net>
      Cc: Paul Fox <pgf@laptop.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93dbc1b3
  6. 23 8月, 2013 4 次提交