1. 29 10月, 2013 1 次提交
    • C
      blk-mq: fix for flush deadlock · 3228f48b
      Christoph Hellwig 提交于
      The flush state machine takes in a struct request, which then is
      submitted multiple times to the underling driver.  The old block code
      requeses the same request for each of those, so it does not have an
      issue with tapping into the request pool.  The new one on the other hand
      allocates a new request for each of the actualy steps of the flush
      sequence. If have already allocated all of the tags for IO, we will
      fail allocating the flush request.
      
      Set aside a reserved request just for flushes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3228f48b
  2. 25 10月, 2013 7 次提交
    • C
      blk-mq: add blk_mq_stop_hw_queues · 280d45f6
      Christoph Hellwig 提交于
      Add a helper to iterate over all hw queues and stop them.  This is useful
      for driver that implement PM suspend functionality.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      
      Modified to just call blk_mq_stop_hw_queue() by Jens.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      280d45f6
    • J
      blk-mq: new multi-queue block IO queueing mechanism · 320ae51f
      Jens Axboe 提交于
      Linux currently has two models for block devices:
      
      - The classic request_fn based approach, where drivers use struct
        request units for IO. The block layer provides various helper
        functionalities to let drivers share code, things like tag
        management, timeout handling, queueing, etc.
      
      - The "stacked" approach, where a driver squeezes in between the
        block layer and IO submitter. Since this bypasses the IO stack,
        driver generally have to manage everything themselves.
      
      With drivers being written for new high IOPS devices, the classic
      request_fn based driver doesn't work well enough. The design dates
      back to when both SMP and high IOPS was rare. It has problems with
      scaling to bigger machines, and runs into scaling issues even on
      smaller machines when you have IOPS in the hundreds of thousands
      per device.
      
      The stacked approach is then most often selected as the model
      for the driver. But this means that everybody has to re-invent
      everything, and along with that we get all the problems again
      that the shared approach solved.
      
      This commit introduces blk-mq, block multi queue support. The
      design is centered around per-cpu queues for queueing IO, which
      then funnel down into x number of hardware submission queues.
      We might have a 1:1 mapping between the two, or it might be
      an N:M mapping. That all depends on what the hardware supports.
      
      blk-mq provides various helper functions, which include:
      
      - Scalable support for request tagging. Most devices need to
        be able to uniquely identify a request both in the driver and
        to the hardware. The tagging uses per-cpu caches for freed
        tags, to enable cache hot reuse.
      
      - Timeout handling without tracking request on a per-device
        basis. Basically the driver should be able to get a notification,
        if a request happens to fail.
      
      - Optional support for non 1:1 mappings between issue and
        submission queues. blk-mq can redirect IO completions to the
        desired location.
      
      - Support for per-request payloads. Drivers almost always need
        to associate a request structure with some driver private
        command structure. Drivers can tell blk-mq this at init time,
        and then any request handed to the driver will have the
        required size of memory associated with it.
      
      - Support for merging of IO, and plugging. The stacked model
        gets neither of these. Even for high IOPS devices, merging
        sequential IO reduces per-command overhead and thus
        increases bandwidth.
      
      For now, this is provided as a potential 3rd queueing model, with
      the hope being that, as it matures, it can replace both the classic
      and stacked model. That would get us back to having just 1 real
      model for block devices, leaving the stacked approach to dm/md
      devices (as it was originally intended).
      
      Contributions in this patch from the following people:
      
      Shaohua Li <shli@fusionio.com>
      Alexander Gordeev <agordeev@redhat.com>
      Christoph Hellwig <hch@infradead.org>
      Mike Christie <michaelc@cs.wisc.edu>
      Matias Bjorling <m@bjorling.me>
      Jeff Moyer <jmoyer@redhat.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      320ae51f
    • S
      percpu_ida: add an API to return free tags · 1dddc01a
      Shaohua Li 提交于
      Add an API to return free tags, blk-mq-tag will use it.
      
      Note, this just returns a snapshot of free tags number. blk-mq-tag has
      two usages of it. One is for info output for diagnosis. The other is to
      quickly check if there are free tags for request dispatch checking.
      Neither requires very precise.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1dddc01a
    • S
      percpu_ida: add percpu_ida_for_each_free · 7fc2ba17
      Shaohua Li 提交于
      Add a new API to iterate free ids. blk-mq-tag will use it.
      
      Note, this doesn't guarantee to iterate all free ids restrictly. Caller
      should be aware of this. blk-mq uses it to do sanity check for request
      timedout, so can tolerate the limitation.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7fc2ba17
    • S
      percpu_ida: make percpu_ida percpu size/batch configurable · e26b53d0
      Shaohua Li 提交于
      Make percpu_ida percpu size/batch configurable. The block-mq-tag will
      use it.
      
      After block-mq uses percpu_ida to manage tags, performance is improved.
      My test is done in a 2 sockets machine, 12 process cross the 2 sockets.
      So if there is lock contention or ipi, should be stressed heavily.
      Testing is done for null-blk.
      
      hw_queue_depth	nopatch iops	patch iops
      64		~800k/s		~1470k/s
      2048		~4470k/s	~4340k/s
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e26b53d0
    • C
      block: remove request ref_count · 71fe07d0
      Christoph Hellwig 提交于
      This reference count has been around since before git history, but the only
      place where it's used is in blk_execute_rq, and ther it is entirely useless
      as it is incremented before submitting the request and decremented in the
      end_io handler before waking up the submitter thread.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71fe07d0
    • J
      block: make rq->cmd_flags be 64-bit · 5953316d
      Jens Axboe 提交于
      We have officially run out of flags in a 32-bit space. Extend it
      to 64-bit even on 32-bit archs.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5953316d
  3. 11 10月, 2013 3 次提交
  4. 04 10月, 2013 1 次提交
  5. 01 10月, 2013 2 次提交
    • N
      skbuff: size of hole is wrong in a comment · 45906723
      Nicolas Dichtel 提交于
      Since commit c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC
      reserves"), hole size is one bit less than what is written in the comment.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45906723
    • R
      mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e
      Rafael Aquini 提交于
      Isolated balloon pages can wrongly end up in LRU lists when
      migrate_pages() finishes its round without draining all the isolated
      page list.
      
      The same issue can happen when reclaim_clean_pages_from_list() tries to
      reclaim pages from an isolated page list, before migration, in the CMA
      path.  Such balloon page leak opens a race window against LRU lists
      shrinkers that leads us to the following kernel panic:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
        IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
        PGD 3cda2067 PUD 3d713067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: shrink_page_list+0x24e/0x897
        RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
        RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
        RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
        R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
        R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
        FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
        Call Trace:
          shrink_inactive_list+0x240/0x3de
          shrink_lruvec+0x3e0/0x566
          __shrink_zone+0x94/0x178
          shrink_zone+0x3a/0x82
          balance_pgdat+0x32a/0x4c2
          kswapd+0x2f0/0x372
          kthread+0xa2/0xaa
          ret_from_fork+0x7c/0xb0
        Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
        RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
         RSP <ffff88003da499b8>
        CR2: 0000000000000028
        ---[ end trace 703d2451af6ffbfd ]---
        Kernel panic - not syncing: Fatal exception
      
      This patch fixes the issue, by assuring the proper tests are made at
      putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
      isolated balloon pages being wrongly reinserted in LRU lists.
      
      [akpm@linux-foundation.org: clarify awkward comment text]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Reported-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Tested-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117aad1e
  6. 29 9月, 2013 1 次提交
  7. 28 9月, 2013 1 次提交
  8. 27 9月, 2013 2 次提交
    • K
      Drivers: hv: util: Correctly support ws2008R2 and earlier · 3a491605
      K. Y. Srinivasan 提交于
      The current code does not correctly negotiate the version numbers for the util
      driver when hosted on earlier hosts. The version numbers presented by this
      driver were not compatible with the version numbers supported by Windows Server
      2008. Fix this problem.
      
      I would like to thank Olaf Hering (ohering@suse.com) for identifying the problem.
      Reported-by: NOlaf Hering <ohering@suse.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a491605
    • A
      bcma: make bcma_core_pci_{up,down}() callable from atomic context · 2bedea8f
      Arend van Spriel 提交于
      This patch removes the bcma_core_pci_power_save() call from
      the bcma_core_pci_{up,down}() functions as it tries to schedule
      thus requiring to call them from non-atomic context. The function
      bcma_core_pci_power_save() is now exported so the calling module
      can explicitly use it in non-atomic context. This fixes the
      'scheduling while atomic' issue reported by Tod Jackson and
      Joe Perches.
      
      [   13.210710] BUG: scheduling while atomic: dhcpcd/1800/0x00000202
      [   13.210718] Modules linked in: brcmsmac nouveau coretemp kvm_intel kvm cordic brcmutil bcma dell_wmi atl1c ttm mxm_wmi wmi
      [   13.210756] CPU: 2 PID: 1800 Comm: dhcpcd Not tainted 3.11.0-wl #1
      [   13.210762] Hardware name: Alienware M11x R2/M11x R2, BIOS A04 11/23/2010
      [   13.210767]  ffff880177c92c40 ffff880170fd1948 ffffffff8169af5b 0000000000000007
      [   13.210777]  ffff880170fd1ab0 ffff880170fd1958 ffffffff81697ee2 ffff880170fd19d8
      [   13.210785]  ffffffff816a19f5 00000000000f4240 000000000000d080 ffff880170fd1fd8
      [   13.210794] Call Trace:
      [   13.210813]  [<ffffffff8169af5b>] dump_stack+0x4f/0x84
      [   13.210826]  [<ffffffff81697ee2>] __schedule_bug+0x43/0x51
      [   13.210837]  [<ffffffff816a19f5>] __schedule+0x6e5/0x810
      [   13.210845]  [<ffffffff816a1c34>] schedule+0x24/0x70
      [   13.210855]  [<ffffffff816a04fc>] schedule_hrtimeout_range_clock+0x10c/0x150
      [   13.210867]  [<ffffffff810684e0>] ? update_rmtp+0x60/0x60
      [   13.210877]  [<ffffffff8106915f>] ? hrtimer_start_range_ns+0xf/0x20
      [   13.210887]  [<ffffffff816a054e>] schedule_hrtimeout_range+0xe/0x10
      [   13.210897]  [<ffffffff8104f6fb>] usleep_range+0x3b/0x40
      [   13.210910]  [<ffffffffa00371af>] bcma_pcie_mdio_set_phy.isra.3+0x4f/0x80 [bcma]
      [   13.210921]  [<ffffffffa003729f>] bcma_pcie_mdio_write.isra.4+0xbf/0xd0 [bcma]
      [   13.210932]  [<ffffffffa0037498>] bcma_pcie_mdio_writeread.isra.6.constprop.13+0x18/0x30 [bcma]
      [   13.210942]  [<ffffffffa00374ee>] bcma_core_pci_power_save+0x3e/0x80 [bcma]
      [   13.210953]  [<ffffffffa003765d>] bcma_core_pci_up+0x2d/0x60 [bcma]
      [   13.210975]  [<ffffffffa03dc17c>] brcms_c_up+0xfc/0x430 [brcmsmac]
      [   13.210989]  [<ffffffffa03d1a7d>] brcms_up+0x1d/0x20 [brcmsmac]
      [   13.211003]  [<ffffffffa03d2498>] brcms_ops_start+0x298/0x340 [brcmsmac]
      [   13.211020]  [<ffffffff81600a12>] ? cfg80211_netdev_notifier_call+0xd2/0x5f0
      [   13.211030]  [<ffffffff815fa53d>] ? packet_notifier+0xad/0x1d0
      [   13.211064]  [<ffffffff81656e75>] ieee80211_do_open+0x325/0xf80
      [   13.211076]  [<ffffffff8106ac09>] ? __raw_notifier_call_chain+0x9/0x10
      [   13.211086]  [<ffffffff81657b41>] ieee80211_open+0x71/0x80
      [   13.211101]  [<ffffffff81526267>] __dev_open+0x87/0xe0
      [   13.211109]  [<ffffffff8152650c>] __dev_change_flags+0x9c/0x180
      [   13.211117]  [<ffffffff815266a3>] dev_change_flags+0x23/0x70
      [   13.211127]  [<ffffffff8158cd68>] devinet_ioctl+0x5b8/0x6a0
      [   13.211136]  [<ffffffff8158d5c5>] inet_ioctl+0x75/0x90
      [   13.211147]  [<ffffffff8150b38b>] sock_do_ioctl+0x2b/0x70
      [   13.211155]  [<ffffffff8150b681>] sock_ioctl+0x71/0x2a0
      [   13.211169]  [<ffffffff8114ed47>] do_vfs_ioctl+0x87/0x520
      [   13.211180]  [<ffffffff8113f159>] ? ____fput+0x9/0x10
      [   13.211198]  [<ffffffff8106228c>] ? task_work_run+0x9c/0xd0
      [   13.211202]  [<ffffffff8114f271>] SyS_ioctl+0x91/0xb0
      [   13.211208]  [<ffffffff816aa252>] system_call_fastpath+0x16/0x1b
      [   13.211217] NOHZ: local_softirq_pending 202
      
      The issue was introduced in v3.11 kernel by following commit:
      
      commit aa51e598
      Author: Hauke Mehrtens <hauke@hauke-m.de>
      Date:   Sat Aug 24 00:32:31 2013 +0200
      
          brcmsmac: use bcma PCIe up and down functions
      
          replace the calls to bcma_core_pci_extend_L1timer() by calls to the
          newly introduced bcma_core_pci_ip() and bcma_core_pci_down()
      Signed-off-by: NHauke Mehrtens <hauke@hauke-m.de>
          Cc: Arend van Spriel <arend@broadcom.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      
      This fix has been discussed with Hauke Mehrtens [1] selection
      option 3) and is intended for v3.12.
      
      Ref:
      [1] http://mid.gmane.org/5239B12D.3040206@hauke-m.de
      
      Cc: <stable@vger.kernel.org> # 3.11.x
      Cc: Tod Jackson <tod.jackson@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Rafal Milecki <zajec5@gmail.com>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Reviewed-by: NHante Meuleman <meuleman@broadcom.com>
      Signed-off-by: NArend van Spriel <arend@broadcom.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      2bedea8f
  9. 26 9月, 2013 2 次提交
    • T
      NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method · 5bc2afc2
      Trond Myklebust 提交于
      Determine if we've created a new file by examining the directory change
      attribute and/or the O_EXCL flag.
      
      This fixes a regression when doing a non-exclusive create of a new file.
      If the FILE_CREATED flag is not set, the atomic_open() command will
      perform full file access permissions checks instead of just checking
      for MAY_OPEN.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      5bc2afc2
    • D
      HID: uhid: allocate static minor · 19872d20
      David Herrmann 提交于
      udev has this nice feature of creating "dead" /dev/<node> device-nodes if
      it finds a devnode:<node> modalias. Once the node is accessed, the kernel
      automatically loads the module that provides the node. However, this
      requires udev to know the major:minor code to use for the node. This
      feature was introduced by:
      
        commit 578454ff
        Author: Kay Sievers <kay.sievers@vrfy.org>
        Date:   Thu May 20 18:07:20 2010 +0200
      
            driver core: add devname module aliases to allow module on-demand auto-loading
      
      However, uhid uses dynamic minor numbers so this doesn't actually work. We
      need to load uhid to know which minor it's going to use.
      
      Hence, allocate a static minor (just like uinput does) and we're good
      to go.
      Reported-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      19872d20
  10. 25 9月, 2013 5 次提交
    • R
      of: clean-up ifdefs in of_irq.h · b0b8c960
      Rob Herring 提交于
      Much of of_irq.h is needlessly ifdef'ed. Clean this up and minimize the
      amount ifdef'ed code. This fixes some  build warnings when CONFIG_OF
      is not enabled (seen on i386 and x86_64):
      
      include/linux/of_irq.h:82:7: warning: 'struct device_node' declared inside parameter list [enabled by default]
      include/linux/of_irq.h:82:7: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
      include/linux/of_irq.h:87:47: warning: 'struct device_node' declared inside parameter list [enabled by default]
      
      Compile tested on i386, sparc and arm.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Grant Likely <grant.likely@linaro.org>
      Signed-off-by: NRob Herring <rob.herring@calxeda.com>
      b0b8c960
    • A
      revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code" · 0608f43d
      Andrew Morton 提交于
      Revert commit 3b38722e ("memcg, vmscan: integrate soft reclaim
      tighter with zone shrinking code")
      
      I merged this prematurely - Michal and Johannes still disagree about the
      overall design direction and the future remains unclear.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0608f43d
    • A
      revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim" · b1aff7fc
      Andrew Morton 提交于
      Revert commit a5b7c87f ("vmscan, memcg: do softlimit reclaim also
      for targeted reclaim")
      
      I merged this prematurely - Michal and Johannes still disagree about the
      overall design direction and the future remains unclear.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1aff7fc
    • A
      revert "memcg: enhance memcg iterator to support predicates" · 694fbc0f
      Andrew Morton 提交于
      Revert commit de57780d ("memcg: enhance memcg iterator to support
      predicates")
      
      I merged this prematurely - Michal and Johannes still disagree about the
      overall design direction and the future remains unclear.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      694fbc0f
    • M
      watchdog: update watchdog_thresh properly · 9809b18f
      Michal Hocko 提交于
      watchdog_tresh controls how often nmi perf event counter checks per-cpu
      hrtimer_interrupts counter and blows up if the counter hasn't changed
      since the last check.  The counter is updated by per-cpu
      watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
      period which guarantees that hrtimer is scheduled 2 times per the main
      period.  Both hrtimer and perf event are started together when the
      watchdog is enabled.
      
      So far so good.  But...
      
      But what happens when watchdog_thresh is updated from sysctl handler?
      
      proc_dowatchdog will set a new sampling period and hrtimer callback
      (watchdog_timer_fn) will use the new value in the next round.  The
      problem, however, is that nobody tells the perf event that the sampling
      period has changed so it is ticking with the period configured when it
      has been set up.
      
      This might result in an ear ripping dissonance between perf and hrtimer
      parts if the watchdog_thresh is increased.  And even worse it might lead
      to KABOOM if the watchdog is configured to panic on such a spurious
      lockup.
      
      This patch fixes the issue by updating both nmi perf even counter and
      hrtimers if the threshold value has changed.
      
      The nmi one is disabled and then reinitialized from scratch.  This has
      an unpleasant side effect that the allocation of the new event might
      fail theoretically so the hard lockup detector would be disabled for
      such cpus.  On the other hand such a memory allocation failure is very
      unlikely because the original event is deallocated right before.
      
      It would be much nicer if we just changed perf event period but there
      doesn't seem to be any API to do that right now.  It is also unfortunate
      that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
      cannot use on_each_cpu() and do the same thing from the per-cpu context.
      The update from the current CPU should be safe because
      perf_event_disable removes the event atomically before it clears the
      per-cpu watchdog_ev so it cannot change anything under running handler
      feet.
      
      The hrtimer is simply restarted (thanks to Don Zickus who has pointed
      this out) if it is queued because we cannot rely it will fire&adopt to
      the new sampling period before a new nmi event triggers (when the
      treshold is decreased).
      
      [akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Fabio Estevam <festevam@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9809b18f
  11. 24 9月, 2013 1 次提交
  12. 23 9月, 2013 1 次提交
  13. 22 9月, 2013 1 次提交
  14. 21 9月, 2013 1 次提交
  15. 20 9月, 2013 1 次提交
    • M
      dm mpath: disable WRITE SAME if it fails · f84cb8a4
      Mike Snitzer 提交于
      Workaround the SCSI layer's problematic WRITE SAME heuristics by
      disabling WRITE SAME in the DM multipath device's queue_limits if an
      underlying device disabled it.
      
      The WRITE SAME heuristics, with both the original commit 5db44863
      ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
      66c28f97 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
      WRITE SAME(10) even without successfully determining it is supported.
      After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
      for the device (by setting sdkp->device->no_write_same which results in
      'max_write_same_sectors' in device's queue_limits to be set to 0).
      
      When a device is stacked ontop of such a SCSI device any changes to that
      SCSI device's queue_limits do not automatically propagate up the stack.
      As such, a DM multipath device will not have its WRITE SAME support
      disabled.  This causes the block layer to continue to issue WRITE SAME
      requests to the mpath device which causes paths to fail and (if mpath IO
      isn't configured to queue when no paths are available) it will result in
      actual IO errors to the upper layers.
      
      This fix doesn't help configurations that have additional devices
      stacked ontop of the mpath device (e.g. LVM created linear DM devices
      ontop).  A proper fix that restacks all the queue_limits from the bottom
      of the device stack up will need to be explored if SCSI will continue to
      use this model of optimistically allowing op codes and then disabling
      them after they fail for the first time.
      
      Before this patch:
      
      EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
      device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
      device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
      end_request: critical target error, dev dm-6, sector 528
      dm-6: WRITE SAME failed. Manually zeroing.
      device-mapper: multipath: Failing path 8:112.
      end_request: I/O error, dev dm-6, sector 4616
      dm-6: WRITE SAME failed. Manually zeroing.
      end_request: I/O error, dev dm-6, sector 4616
      end_request: I/O error, dev dm-6, sector 5640
      end_request: I/O error, dev dm-6, sector 6664
      end_request: I/O error, dev dm-6, sector 7688
      end_request: I/O error, dev dm-6, sector 524288
      Buffer I/O error on device dm-6, logical block 65536
      lost page write due to I/O error on dm-6
      JBD2: Error -5 detected when updating journal superblock for dm-6-8.
      end_request: I/O error, dev dm-6, sector 524296
      Aborting journal on device dm-6-8.
      end_request: I/O error, dev dm-6, sector 524288
      Buffer I/O error on device dm-6, logical block 65536
      lost page write due to I/O error on dm-6
      JBD2: Error -5 detected when updating journal superblock for dm-6-8.
      
      # cat /sys/block/sdh/queue/write_same_max_bytes
      0
      # cat /sys/block/dm-6/queue/write_same_max_bytes
      33553920
      
      After this patch:
      
      EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
      device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
      device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
      end_request: critical target error, dev dm-6, sector 528
      dm-6: WRITE SAME failed. Manually zeroing.
      
      # cat /sys/block/sdh/queue/write_same_max_bytes
      0
      # cat /sys/block/dm-6/queue/write_same_max_bytes
      0
      
      It should be noted that WRITE SAME support wasn't enabled in DM
      multipath until v3.10.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: stable@vger.kernel.org # 3.10+
      f84cb8a4
  16. 17 9月, 2013 3 次提交
  17. 16 9月, 2013 1 次提交
  18. 13 9月, 2013 6 次提交