1. 20 12月, 2008 3 次提交
  2. 19 12月, 2008 4 次提交
  3. 18 12月, 2008 1 次提交
  4. 16 12月, 2008 1 次提交
  5. 11 12月, 2008 6 次提交
  6. 10 12月, 2008 1 次提交
    • N
      netpoll: fix race on poll_list resulting in garbage entry · 7b363e44
      Neil Horman 提交于
      	A few months back a race was discused between the netpoll napi service
      path, and the fast path through net_rx_action:
      http://kerneltrap.org/mailarchive/linux-netdev/2007/10/16/345470
      
      A patch was submitted for that bug, but I think we missed a case.
      
      Consider the following scenario:
      
      INITIAL STATE
      CPU0 has one napi_struct A on its poll_list
      CPU1 is calling netpoll_send_skb and needs to call poll_napi on the same
      napi_struct A that CPU0 has on its list
      
      
      
      CPU0						CPU1
      net_rx_action					poll_napi
      !list_empty (returns true)			locks poll_lock for A
      						 poll_one_napi
      						  napi->poll
      						   netif_rx_complete
      						    __napi_complete
      						    (removes A from poll_list)
      list_entry(list->next)
      
      
      In the above scenario, net_rx_action assumes that the per-cpu poll_list is
      exclusive to that cpu.  netpoll of course violates that, and because the netpoll
      path can dequeue from the poll list, its possible for CPU0 to detect a non-empty
      list at the top of the while loop in net_rx_action, but have it become empty by
      the time it calls list_entry.  Since the poll_list isn't surrounded by any other
      structure, the returned data from that list_entry call in this situation is
      garbage, and any number of crashes can result based on what exactly that garbage
      is.
      
      Given that its not fasible for performance reasons to place exclusive locks
      arround each cpus poll list to provide that mutal exclusion, I think the best
      solution is modify the netpoll path in such a way that we continue to guarantee
      that the poll_list for a cpu is in fact exclusive to that cpu.  To do this I've
      implemented the patch below.  It adds an additional bit to the state field in
      the napi_struct.  When executing napi->poll from the netpoll_path, this bit will
      be set. When a driver calls netif_rx_complete, if that bit is set, it will not
      remove the napi_struct from the poll_list.  That work will be saved for the next
      iteration of net_rx_action.
      
      I've tested this and it seems to work well.  About the biggest drawback I can
      see to it is the fact that it might result in an extra loop through
      net_rx_action in the event that the device is actually contended for (i.e. the
      netpoll path actually preforms all the needed work no the device, and the call
      to net_rx_action winds up doing nothing, except removing the napi_struct from
      the poll_list.  However I think this is probably a small price to pay, given
      that the alternative is a crash.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b363e44
  7. 09 12月, 2008 3 次提交
  8. 06 12月, 2008 1 次提交
  9. 04 12月, 2008 3 次提交
  10. 03 12月, 2008 4 次提交
    • M
      block: fix setting of max_segment_size and seg_boundary mask · 0e435ac2
      Milan Broz 提交于
      Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
      devices.
      
      When stacking devices (LVM over MD over SCSI) some of the request queue
      parameters are not set up correctly in some cases by default, namely
      max_segment_size and and seg_boundary mask.
      
      If you create MD device over SCSI, these attributes are zeroed.
      
      Problem become when there is over this mapping next device-mapper mapping
      - queue attributes are set in DM this way:
      
      request_queue   max_segment_size  seg_boundary_mask
      SCSI                65536             0xffffffff
      MD RAID1                0                      0
      LVM                 65536                 -1 (64bit)
      
      Unfortunately bio_add_page (resp.  bio_phys_segments) calculates number of
      physical segments according to these parameters.
      
      During the generic_make_request() is segment cout recalculated and can
      increase bio->bi_phys_segments count over the allowed limit.  (After
      bio_clone() in stack operation.)
      
      Thi is specially problem in CCISS driver, where it produce OOPS here
      
          BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);
      
      (MAXSEGENTRIES is 31 by default.)
      
      Sometimes even this command is enough to cause oops:
      
        dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10
      
      This command generates bios with 250 sectors, allocated in 32 4k-pages
      (last page uses only 1024 bytes).
      
      For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
      unfortunatelly on lower layer it is recalculated to 32 segments and this
      violates CCISS restriction and triggers BUG_ON().
      
      The patch tries to fix it by:
      
       * initializing attributes above in queue request constructor
         blk_queue_make_request()
      
       * make sure that blk_queue_stack_limits() inherits setting
      
       (DM uses its own function to set the limits because it
       blk_queue_stack_limits() was introduced later.  It should probably switch
       to use generic stack limit function too.)
      
       * sets the default seg_boundary value in one place (blkdev.h)
      
       * use this mask as default in DM (instead of -1, which differs in 64bit)
      
      Bugs related to this:
      https://bugzilla.redhat.com/show_bug.cgi?id=471639
      http://bugzilla.kernel.org/show_bug.cgi?id=8672Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Miller <mike.miller@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0e435ac2
    • T
      block: internal dequeue shouldn't start timer · 53a08807
      Tejun Heo 提交于
      blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
      both start the timeout timer.  Barrier code dequeues the original
      barrier request but doesn't passes the request itself to lower level
      driver, only broken down proxy requests; however, as the original
      barrier code goes through the same dequeue path and timeout timer is
      started on it.  If barrier sequence takes long enough, this timer
      expires but the low level driver has no idea about this request and
      oops follows.
      
      Timeout timer shouldn't have been started on the original barrier
      request as it never goes through actual IO.  This patch unexports
      elv_dequeue_request(), which has no external user anyway, and makes it
      operate on elevator proper w/o adding the timer and make
      blkdev_dequeue_request() call elv_dequeue_request() and add timer.
      Internal users which don't pass the request to driver - barrier code
      and end_that_request_last() - are converted to use
      elv_dequeue_request().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      53a08807
    • J
      nfsd: fix vm overcommit crash fix #2 · 1b79cd04
      Junjiro R. Okajima 提交于
      The previous patch from Alan Cox ("nfsd: fix vm overcommit crash",
      commit 731572d3) fixed the problem where
      knfsd crashes on exported shmemfs objects and strict overcommit is set.
      
      But the patch forgot supporting the case when CONFIG_SECURITY is
      disabled.
      
      This patch copies a part of his fix which is mainly for detecting a bug
      earlier.
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NJunjiro R. Okajima <hooanon05@yahoo.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b79cd04
    • B
      amd74xx: workaround unreliable AltStatus register for nVidia controllers · 6636487e
      Bartlomiej Zolnierkiewicz 提交于
      It seems that on some nVidia controllers using AltStatus register
      can be unreliable so default to Status register if the PCI device
      is in Compatibility Mode.  In order to achieve this:
      
      * Add ide_pci_is_in_compatibility_mode() inline helper to <linux/ide.h>.
      
      * Add IDE_HFLAG_BROKEN_ALTSTATUS host flag and set it in amd74xx host
        driver for nVidia controllers in Compatibility Mode.
      
      * Teach actual_try_to_identify() and drive_is_ready() about the new flag.
      
      This fixes the regression caused by removal of CONFIG_IDEPCI_SHARE_IRQ
      config option in 2.6.25 and using AltStatus register unconditionally when
      available (kernel.org bugs #11659 and #10216).
      
      [ Moreover for CONFIG_IDEPCI_SHARE_IRQ=y (which is what most people
        and distributions use) it never worked correctly. ]
      
      Thanks to Remy LABENE and Lars Winterfeld for help with debugging the problem.
      
      More info at:
      http://bugzilla.kernel.org/show_bug.cgi?id=11659
      http://bugzilla.kernel.org/show_bug.cgi?id=10216Reported-by: NRemy LABENE <remy.labene@free.fr>
      Tested-by: NRemy LABENE <remy.labene@free.fr>
      Tested-by: NLars Winterfeld <lars.winterfeld@tu-ilmenau.de>
      Acked-by: NBorislav Petkov <petkovbb@gmail.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      6636487e
  11. 02 12月, 2008 3 次提交
    • M
      lib/idr.c: fix rcu related race with idr_find · 6ff2d39b
      Manfred Spraul 提交于
      2nd part of the fixes needed for
      http://bugzilla.kernel.org/show_bug.cgi?id=11796.
      
      When the idr tree is either grown or shrunk, then the update to the number
      of layers and the top pointer were not atomic.  This race caused crashes.
      
      The attached patch fixes that by replicating the layers counter in each
      layer, thus idr_find doesn't need idp->layers anymore.
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Clement Calmels <cboulte@gmail.com>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Cc: Pierre Peiffer <peifferp@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ff2d39b
    • D
      epoll: introduce resource usage limits · 7ef9964e
      Davide Libenzi 提交于
      It has been thought that the per-user file descriptors limit would also
      limit the resources that a normal user can request via the epoll
      interface.  Vegard Nossum reported a very simple program (a modified
      version attached) that can make a normal user to request a pretty large
      amount of kernel memory, well within the its maximum number of fds.  To
      solve such problem, default limits are now imposed, and /proc based
      configuration has been introduced.  A new directory has been created,
      named /proc/sys/fs/epoll/ and inside there, there are two configuration
      points:
      
        max_user_instances = Maximum number of devices - per user
      
        max_user_watches   = Maximum number of "watched" fds - per user
      
      The current default for "max_user_watches" limits the memory used by epoll
      to store "watches", to 1/32 of the amount of the low RAM.  As example, a
      256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
      That should be enough to not break existing heavy epoll users.  The
      default value for "max_user_instances" is set to 128, that should be
      enough too.
      
      This also changes the userspace, because a new error code can now come out
      from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
      listed, so that should be ok.
      
      [akpm@linux-foundation.org: use get_current_user()]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: <stable@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reported-by: NVegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ef9964e
    • T
      libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQ · ac70a964
      Tejun Heo 提交于
      Some recent Seagate harddrives have firmware bug which causes FLUSH
      CACHE to timeout under certain circumstances if NCQ is being used.
      This can be worked around by disabling NCQ and fixed by updating the
      firmware.  Implement ATA_HORKAGE_FIRMWARE_UPDATE and blacklist these
      devices.
      
      The wiki page has been updated to contain information on this issue.
      
        http://ata.wiki.kernel.org/index.php/Known_issuesSigned-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      ac70a964
  12. 01 12月, 2008 3 次提交
  13. 29 11月, 2008 1 次提交
  14. 28 11月, 2008 1 次提交
    • R
      Allow architectures to override copy_user_highpage() · 487ff320
      Russell King 提交于
      With aliasing VIPT cache support, the ARM implementation of
      clear_user_page() and copy_user_page() sets up a temporary kernel space
      mapping such that we have the same cache colour as the userspace page.
      This avoids having to consider any userspace aliases from this operation.
      
      However, when highmem is enabled, kmap_atomic() have to setup mappings.
      The copy_user_highpage() and clear_user_highpage() call these functions
      before delegating the copies to copy_user_page() and clear_user_page().
      
      The effect of this is that each of the *_user_highpage() functions setup
      their own kmap mapping, followed by the *_user_page() functions setting
      up another mapping.  This is rather wasteful.
      
      Thankfully, copy_user_highpage() can be overriden by architectures by
      defining __HAVE_ARCH_COPY_USER_HIGHPAGE.  However, replacement of
      clear_user_highpage() is more difficult because its inline definition
      is not conditional.  It seems that you're expected to define
      __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE and provide a replacement
      __alloc_zeroed_user_highpage() implementation instead.
      
      The allocation itself is fine, so we don't want to override that.  What
      we really want to do is to override clear_user_highpage() with our own
      version which doesn't kmap_atomic() unnecessarily.
      
      Other VIPT architectures (PARISC and SH) would also like to override
      this function as well.
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      487ff320
  15. 27 11月, 2008 3 次提交
  16. 25 11月, 2008 2 次提交