1. 29 5月, 2011 31 次提交
    • M
      dm kcopyd: preallocate sub jobs to avoid deadlock · c6ea41fb
      Mikulas Patocka 提交于
      There's a possible theoretical deadlock in dm-kcopyd because multiple
      allocations from the same mempool are required to finish a request.
      Avoid this by preallocating sub jobs.
      
      There is a mempool of 512 entries. Each request requires up to 9
      entries from the mempool. If we have at least 57 concurrent requests
      running, the mempool may overflow and mempool allocations may start
      blocking until another entry is freed to the mempool. Because the same
      thread is used to free entries to the mempool and allocate entries from
      the mempool, this may result in a deadlock.
      
      This patch changes it so that one mempool entry contains all 9 "struct
      kcopyd_job" required to fulfill the whole request. The allocation is
      done only once in dm_kcopyd_copy and no further mempool allocations are
      done during request processing.
      
      If dm_kcopyd_copy is not run in the completion thread, this
      implementation is deadlock-free.
      
      MIN_JOBS needs reducing accordingly and we've chosen to reduce it
      further to 8.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      c6ea41fb
    • M
      dm kcopyd: avoid pointless job splitting · a705a34a
      Mikulas Patocka 提交于
      Don't split SUB_JOB_SIZE jobs
      
      If the job size equals SUB_JOB_SIZE, there is no point in splitting it.
      Splitting it just unnecessarily wastes time, because the split job size
      is SUB_JOB_SIZE too.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a705a34a
    • M
      dm mpath: do not fail paths after integrity errors · 6f13f6fb
      Martin K. Petersen 提交于
      Integrity errors need to be passed to the owner of the integrity
      metadata for processing. Consequently EILSEQ should be passed up the
      stack.
      
      Cc: stable@kernel.org
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      6f13f6fb
    • M
      dm table: reject devices without request fns · f4808ca9
      Milan Broz 提交于
      This patch adds a check that a block device has a request function
      defined before it is used.  Otherwise, misconfiguration can cause an oops.
      
      Because we are allowing devices with zero size e.g. an offline multipath
      device as in commit 2cd54d9b
      ("dm: allow offline devices") there needs to be an additional check
      to ensure devices are initialised.  Some block devices, like a loop
      device without a backing file, exist but have no request function.
      
      Reproducer is trivial: dm-mirror on unbound loop device
      (no backing file on loop devices)
      
      dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0"
      
      and mirror resync will immediatelly cause OOps.
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
       ? generic_make_request+0x2bd/0x590
       ? kmem_cache_alloc+0xad/0x190
       submit_bio+0x53/0xe0
       ? bio_add_page+0x3b/0x50
       dispatch_io+0x1ca/0x210 [dm_mod]
       ? read_callback+0x0/0xd0 [dm_mirror]
       dm_io+0xbb/0x290 [dm_mod]
       do_mirror+0x1e0/0x748 [dm_mirror]
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f4808ca9
    • M
      dm table: allow targets to support discards internally · 4c259327
      Mike Snitzer 提交于
      Permit a target to support discards regardless of whether or not all its
      underlying devices do.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4c259327
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin · 139f37f5
      Linus Torvalds 提交于
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin:
        Blackfin: debug-mmrs: include RSI_PID[4567] MMRs
        Blackfin: bf51x: fix up RSI_PID# MMR defines
        Blackfin: bf52x/bf54x: fix up usb MMR defines
        Blackfin: debug-mmrs: fix typos with gptimers/mdma/ppi
        Blackfin: gptimers: add structure for hardware register layout
        Blackfin: wire up new sendmmsg syscall
        Blackfin: mach/bfin_serial_5xx.h: punt now-unused header
        Blackfin: bfin_serial.h: turn default port wrappers into stubs
      139f37f5
    • R
      scsi: fix scsi_proc new kernel-doc warning · 5be7ef00
      Randy Dunlap 提交于
      Fix kernel-doc warnings in scsi_proc.c:
      
        Warning(drivers/scsi/scsi_proc.c:390): No description found for parameter 'dev'
        Warning(drivers/scsi/scsi_proc.c:390): No description found for parameter 'data'
        Warning(drivers/scsi/scsi_proc.c:390): Excess function parameter 's' description in 'always_match'
        Warning(drivers/scsi/scsi_proc.c:390): Excess function parameter 'p' description in 'always_match'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5be7ef00
    • H
      mm: fix page_lock_anon_vma leaving mutex locked · eee0f252
      Hugh Dickins 提交于
      On one machine I've been getting hangs, a page fault's anon_vma_prepare()
      waiting in anon_vma_lock(), other processes waiting for that page's lock.
      
      This is a replay of last year's f1819427 "mm: fix hang on
      anon_vma->root->lock".
      
      The new page_lock_anon_vma() places too much faith in its refcount: when
      it has acquired the mutex_trylock(), it's possible that a racing task in
      anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
      to 1, and is about to reset its anon_vma->root.
      
      Fix this by saving anon_vma->root, and relying on the usual page_mapped()
      check instead of a refcount check: if page is still mapped, the anon_vma
      is still ours; if page is not still mapped, we're no longer interested.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eee0f252
    • H
      mm: fix kernel BUG at mm/rmap.c:1017! · 5dbe0af4
      Hugh Dickins 提交于
      I've hit the "address >= vma->vm_end" check in do_page_add_anon_rmap()
      just once.  The stack showed khugepaged allocation trying to compact
      pages: the call to page_add_anon_rmap() coming from remove_migration_pte().
      
      That path holds anon_vma lock, but does not hold mmap_sem: it can
      therefore race with a split_vma(), and in commit 5f70b962 "mmap:
      avoid unnecessary anon_vma lock" we just took away the anon_vma lock
      protection when adjusting vma->vm_end.
      
      I don't think that particular BUG_ON ever caught anything interesting,
      so better replace it by a comment, than reinstate the anon_vma locking.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dbe0af4
    • H
      tmpfs: fix race between truncate and writepage · 826267cf
      Hugh Dickins 提交于
      While running fsx on tmpfs with a memhog then swapoff, swapoff was hanging
      (interruptibly), repeatedly failing to locate the owner of a 0xff entry in
      the swap_map.
      
      Although shmem_writepage() does abandon when it sees incoming page index
      is beyond eof, there was still a window in which shmem_truncate_range()
      could come in between writepage's dropping lock and updating swap_map,
      find the half-completed swap_map entry, and in trying to free it,
      leave it in a state that swap_shmem_alloc() could not correct.
      
      Arguably a bug in __swap_duplicate()'s and swap_entry_free()'s handling
      of the different cases, but easiest to fix by moving swap_shmem_alloc()
      under cover of the lock.
      
      More interesting than the bug: it's been there since 2.6.33, why could
      I not see it with earlier kernels?  The mmotm of two weeks ago seems to
      have some magic for generating races, this is just one of three I found.
      
      With yesterday's git I first saw this in mainline, bisected in search of
      that magic, but the easy reproducibility evaporated.  Oh well, fix the bug.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      826267cf
    • M
      Blackfin: debug-mmrs: include RSI_PID[4567] MMRs · c320afe9
      Mike Frysinger 提交于
      The documentation is a little iffy as to whether these are actual MMRs,
      but reading them on the hardware works, and the previous version of this
      logic (the SDH) had PID[4567].  So add it for RSI too.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      c320afe9
    • M
      Blackfin: bf51x: fix up RSI_PID# MMR defines · fcb24391
      Mike Frysinger 提交于
      Looks like the copying of MMR defines from the SDH block missed updating
      the addresses of the RSI_PID# registers.  So tweak them to reflect the
      actual hardware.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      fcb24391
    • M
      Blackfin: bf52x/bf54x: fix up usb MMR defines · 61aa818f
      Mike Frysinger 提交于
      The bf52x/bf54x have the incorrect addresses for USB_EP_NI7_RXINTERVAL
      and USB_EP_NI7_TXCOUNT, so adjust those.
      
      Further, the bf54x header puts the USB defines in the wrong place, so
      shuffle them back to the right grouping.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      61aa818f
    • M
      Blackfin: debug-mmrs: fix typos with gptimers/mdma/ppi · d09fb602
      Mike Frysinger 提交于
      This code was mostly developed against a BF54x, so some BF537-specific
      issues were missed.
      
      The PPI block starts at PPI_CONTROL, not PPI_STATUS (which is the reverse
      of the EPPI block).
      
      The MDMA block starts at MDMA_NEXT_DESC_PTR, not MDMA_CONFIG.  Seems the
      sim does not catch misreads here so that'll need to get fixed.
      
      The gptimer block is mostly 32bit regs, not 16bit.  Use the gptimer struct
      to figure that out rather than hardcoding it locally.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      d09fb602
    • M
    • M
      Blackfin: wire up new sendmmsg syscall · 427472c9
      Mike Frysinger 提交于
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      427472c9
    • M
      Blackfin: mach/bfin_serial_5xx.h: punt now-unused header · 63917efc
      Mike Frysinger 提交于
      Now that the serial code has been unified in bfin_serial.h, and the
      Blackfin UART driver pushed its resources to the boards files, we
      don't need these headers anymore.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      63917efc
    • M
      Blackfin: bfin_serial.h: turn default port wrappers into stubs · 091c7598
      Mike Frysinger 提交于
      Any consumer that needs to access the MMRs has to provide these helpers,
      so make the default into useless stubs.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      091c7598
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 · 36947a76
      Linus Torvalds 提交于
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
        Cache xattr security drop check for write v2
        fs: block_page_mkwrite should wait for writeback to finish
        mm: Wait for writeback when grabbing pages to begin a write
        configfs: remove unnecessary dentry_unhash on rmdir, dir rename
        fat: remove unnecessary dentry_unhash on rmdir, dir rename
        hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
        minix: remove unnecessary dentry_unhash on rmdir, dir rename
        fuse: remove unnecessary dentry_unhash on rmdir, dir rename
        coda: remove unnecessary dentry_unhash on rmdir, dir rename
        afs: remove unnecessary dentry_unhash on rmdir, dir rename
        affs: remove unnecessary dentry_unhash on rmdir, dir rename
        9p: remove unnecessary dentry_unhash on rmdir, dir rename
        ncpfs: fix rename over directory with dangling references
        ncpfs: document dentry_unhash usage
        ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
        hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
        hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
        hfs: remove unnecessary dentry_unhash on rmdir, dir rename
        omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
        udf: remove unnecessary dentry_unhash from rmdir, dir rename
        ...
      36947a76
    • L
      Merge branch 'x86-urgent-for-linus' of... · a947e23a
      Linus Torvalds 提交于
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        x86, asm: Clean up desc.h a bit
        x86, amd: Do not enable ARAT feature on AMD processors below family 0x12
        x86: Move do_page_fault()'s error path under unlikely()
        x86, efi: Retain boot service code until after switching to virtual mode
        x86: Remove unnecessary check in detect_ht()
        x86: Reorder mm_context_t to remove x86_64 alignment padding and thus shrink mm_struct
        x86, UV: Clean up uv_tlb.c
        x86, UV: Add support for SGI UV2 hub chip
        x86, cpufeature: Update CPU feature RDRND to RDRAND
      a947e23a
    • L
      Merge branch 'sched-urgent-for-linus' of... · 08a8b796
      Linus Torvalds 提交于
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
        sched: Fix ->min_vruntime calculation in dequeue_entity()
        sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
        sched: More sched_domain iterations fixes
      08a8b796
    • L
      Merge branch 'core-urgent-for-linus' of... · 1ba4b8cb
      Linus Torvalds 提交于
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
        rcu: Remove waitqueue usage for cpu, node, and boost kthreads
        rcu: Avoid acquiring rcu_node locks in timer functions
        atomic: Add atomic_or()
        Documentation: Add statistics about nested locks
        rcu: Decrease memory-barrier usage based on semi-formal proof
        rcu: Make rcu_enter_nohz() pay attention to nesting
        rcu: Don't do reschedule unless in irq
        rcu: Remove old memory barriers from rcu_process_callbacks()
        rcu: Add memory barriers
        rcu: Fix unpaired rcu_irq_enter() from locking selftests
      1ba4b8cb
    • L
      Merge branch 'perf-urgent-for-linus' of... · c4a227d8
      Linus Torvalds 提交于
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
        perf: Fix SIGIO handling
        perf top: Don't stop if no kernel symtab is found
        perf top: Handle kptr_restrict
        perf top: Remove unused macro
        perf events: initialize fd array to -1 instead of 0
        perf tools: Make sure kptr_restrict warnings fit 80 col terms
        perf tools: Fix build on older systems
        perf symbols: Handle /proc/sys/kernel/kptr_restrict
        perf: Remove duplicate headers
        ftrace: Add internal recursive checks
        tracing: Update btrfs's tracepoints to use u64 interface
        tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
        ftrace: Set ops->flag to enabled even on static function tracing
        tracing: Have event with function tracer check error return
        ftrace: Have ftrace_startup() return failure code
        jump_label: Check entries limit in __jump_label_update
        ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
        scripts/tags.sh: Add magic for trace-events for etags too
        scripts/tags.sh: Fix ctags for DEFINE_EVENT()
        x86/ftrace: Fix compiler warning in ftrace.c
        ...
      c4a227d8
    • L
      Merge branch 'for-usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci · 87367a0b
      Linus Torvalds 提交于
      * 'for-usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci:
        Intel xhci: Limit number of active endpoints to 64.
        Intel xhci: Ignore spurious successful event.
        Intel xhci: Support EHCI/xHCI port switching.
        Intel xhci: Add PCI id for Panther Point xHCI host.
        xhci: STFU: Be quieter during URB submission and completion.
        xhci: STFU: Don't print event ring dequeue pointer.
        xhci: STFU: Remove function tracing.
        xhci: Don't submit commands when the host is dead.
        xhci: Clear stopped_td when Stop Endpoint command completes.
      87367a0b
    • L
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx · 4cb865de
      Linus Torvalds 提交于
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (33 commits)
        x86: poll waiting for I/OAT DMA channel status
        maintainers: add dma engine tree details
        dmaengine: add TODO items for future work on dma drivers
        dmaengine: Add API documentation for slave dma usage
        dmaengine/dw_dmac: Update maintainer-ship
        dmaengine: move link order
        dmaengine/dw_dmac: implement pause and resume in dwc_control
        dmaengine/dw_dmac: Replace spin_lock* with irqsave variants and enable submission from callback
        dmaengine/dw_dmac: Divide one sg to many desc, if sg len is greater than DWC_MAX_COUNT
        dmaengine/dw_dmac: set residue as total len in dwc_tx_status if status is !DMA_SUCCESS
        dmaengine/dw_dmac: don't call callback routine in case dmaengine_terminate_all() is called
        dmaengine: at_hdmac: pause: no need to wait for FIFO empty
        pch_dma: modify pci device table definition
        pch_dma: Support new device ML7223 IOH
        pch_dma: Support I2S for ML7213 IOH
        pch_dma: Fix DMA setting issue
        pch_dma: modify for checkpatch
        pch_dma: fix dma direction issue for ML7213 IOH video-in
        dmaengine: at_hdmac: use descriptor chaining help function
        dmaengine: at_hdmac: implement pause and resume in atc_control
        ...
      
      Fix up trivial conflict in drivers/dma/dw_dmac.c
      4cb865de
    • L
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 · 55f08e1b
      Linus Torvalds 提交于
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
        mfd: Fix build breakage caused by tps65910 gpio directory move
        mfd: Use mfd cell platform_data for db8500-prcmu cells platform bits
      55f08e1b
    • L
      Merge branch 'spi/next' of git://git.secretlab.ca/git/linux-2.6 · d02bf062
      Linus Torvalds 提交于
      * 'spi/next' of git://git.secretlab.ca/git/linux-2.6:
        spi/spi_bfin_sport: new driver for a SPI bus via the Blackfin SPORT peripheral
        spi/tle620x: add missing device_remove_file()
      d02bf062
    • L
      Merge branch 'gpio/next' of git://git.secretlab.ca/git/linux-2.6 · 04830fcc
      Linus Torvalds 提交于
      * 'gpio/next' of git://git.secretlab.ca/git/linux-2.6:
        gpio/pch_gpio: Support new device ML7223
        gpio: make gpio_{request,free}_array gpio array parameter const
        GPIO: OMAP: move to drivers/gpio
        GPIO: OMAP: move register offset defines into <plat/gpio.h>
        gpio: Convert gpio_is_valid to return bool
        gpio: Move the s5pc100 GPIO to drivers/gpio
        gpio: Move the s5pv210 GPIO to drivers/gpio
        gpio: Move the exynos4 GPIO to drivers/gpio
        gpio: Move to Samsung common GPIO library to drivers/gpio
        gpio/nomadik: add function to read GPIO pull down status
        gpio/nomadik: show all pins in debug
        gpio: move Nomadik GPIO driver to drivers/gpio
        gpio: move U300 GPIO driver to drivers/gpio
        langwell_gpio: add runtime pm support
        gpio/pca953x: Add support for pca9574 and pca9575 devices
        gpio/cs5535: Show explicit dependency between gpio_cs5535 and mfd_cs5535
      04830fcc
    • L
      Merge branch 'setns' · 571503e1
      Linus Torvalds 提交于
      * setns:
        ns: Wire up the setns system call
      
      Done as a merge to make it easier to fix up conflicts in arm due to
      addition of sendmmsg system call
      571503e1
    • E
      ns: Wire up the setns system call · 7b21fddd
      Eric W. Biederman 提交于
      32bit and 64bit on x86 are tested and working.  The rest I have looked
      at closely and I can't find any problems.
      
      setns is an easy system call to wire up.  It just takes two ints so I
      don't expect any weird architecture porting problems.
      
      While doing this I have noticed that we have some architectures that are
      very slow to get new system calls.  cris seems to be the slowest where
      the last system calls wired up were preadv and pwritev.  avr32 is weird
      in that recvmmsg was wired up but never declared in unistd.h.  frv is
      behind with perf_event_open being the last syscall wired up.  On h8300
      the last system call wired up was epoll_wait.  On m32r the last system
      call wired up was fallocate.  mn10300 has recvmmsg as the last system
      call wired up.  The rest seem to at least have syncfs wired up which was
      new in the 2.6.39.
      
      v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
      v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
      v4: Moved wiring up of the system call to another patch
      v5: ported to v2.6.39-rc6
      v6: rebased onto parisc-next and net-next to avoid syscall  conflicts.
      v7: ported to Linus's latest post 2.6.39 tree.
      
      >  arch/blackfin/include/asm/unistd.h     |    3 ++-
      >  arch/blackfin/mach-common/entry.S      |    1 +
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      
      Oh - ia64 wiring looks good.
      Acked-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b21fddd
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732
  2. 28 5月, 2011 9 次提交
    • P
      rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state · cc3ce517
      Paul E. McKenney 提交于
      Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
      result in softlockup warnings.  Because some of RCU's kthreads can
      legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
      state in order to avoid those warnings.
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc3ce517
    • P
      rcu: Remove waitqueue usage for cpu, node, and boost kthreads · 08bca60a
      Peter Zijlstra 提交于
      It is not necessary to use waitqueues for the RCU kthreads because
      we always know exactly which thread is to be awakened.  In addition,
      wake_up() only issues an actual wakeup when there is a thread waiting on
      the queue, which was why there was an extra explicit wake_up_process()
      to get the RCU kthreads started.
      
      Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
      eliminates the need for the initial wake_up_process() and also shrinks
      the data structure size a bit.  The wakeup logic is placed in a new
      rcu_wait() macro.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      08bca60a
    • P
      rcu: Avoid acquiring rcu_node locks in timer functions · 8826f3b0
      Paul E. McKenney 提交于
      This commit switches manipulations of the rcu_node ->wakemask field
      to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
      acquiring the rcu_node lock.  This should avoid the following lockdep
      splat reported by Valdis Kletnieks:
      
      [   12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
      [   12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
      [   12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
      [   12.987691] hub 1-4:1.0: USB hub found
      [   12.987877] hub 1-4:1.0: 3 ports detected
      [   12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
      [   13.071471] udevadm used greatest stack depth: 3984 bytes left
      [   13.172129]
      [   13.172130] =======================================================
      [   13.172425] [ INFO: possible circular locking dependency detected ]
      [   13.172650] 2.6.39-rc6-mmotm0506 #1
      [   13.172773] -------------------------------------------------------
      [   13.172997] blkid/267 is trying to acquire lock:
      [   13.173009]  (&p->pi_lock){-.-.-.}, at: [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]
      [   13.173009] but task is already holding lock:
      [   13.173009]  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
      [   13.173009]
      [   13.173009] which lock already depends on the new lock.
      [   13.173009]
      [   13.173009]
      [   13.173009] the existing dependency chain (in reverse order) is:
      [   13.173009]
      [   13.173009] -> #2 (rcu_node_level_0){..-...}:
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
      [   13.173009]        [<ffffffff81090794>] rcu_read_unlock_special+0x8c/0x1d5
      [   13.173009]        [<ffffffff8109092c>] __rcu_read_unlock+0x4f/0xd7
      [   13.173009]        [<ffffffff81027bd3>] rcu_read_unlock+0x21/0x23
      [   13.173009]        [<ffffffff8102cc34>] cpuacct_charge+0x6c/0x75
      [   13.173009]        [<ffffffff81030cc6>] update_curr+0x101/0x12e
      [   13.173009]        [<ffffffff810311d0>] check_preempt_wakeup+0xf7/0x23b
      [   13.173009]        [<ffffffff8102acb3>] check_preempt_curr+0x2b/0x68
      [   13.173009]        [<ffffffff81031d40>] ttwu_do_wakeup+0x76/0x128
      [   13.173009]        [<ffffffff81031e49>] ttwu_do_activate.constprop.63+0x57/0x5c
      [   13.173009]        [<ffffffff81031e96>] scheduler_ipi+0x48/0x5d
      [   13.173009]        [<ffffffff810177d5>] smp_reschedule_interrupt+0x16/0x18
      [   13.173009]        [<ffffffff815710f3>] reschedule_interrupt+0x13/0x20
      [   13.173009]        [<ffffffff810b66d1>] rcu_read_unlock+0x21/0x23
      [   13.173009]        [<ffffffff810b739c>] find_get_page+0xa9/0xb9
      [   13.173009]        [<ffffffff810b8b48>] filemap_fault+0x6a/0x34d
      [   13.173009]        [<ffffffff810d1a25>] __do_fault+0x54/0x3e6
      [   13.173009]        [<ffffffff810d447a>] handle_pte_fault+0x12c/0x1ed
      [   13.173009]        [<ffffffff810d48f7>] handle_mm_fault+0x1cd/0x1e0
      [   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   13.173009]
      [   13.173009] -> #1 (&rq->lock){-.-.-.}:
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
      [   13.173009]        [<ffffffff81027e19>] __task_rq_lock+0x8b/0xd3
      [   13.173009]        [<ffffffff81032f7f>] wake_up_new_task+0x41/0x108
      [   13.173009]        [<ffffffff810376c3>] do_fork+0x265/0x33f
      [   13.173009]        [<ffffffff81007d02>] kernel_thread+0x6b/0x6d
      [   13.173009]        [<ffffffff8153a9dd>] rest_init+0x21/0xd2
      [   13.173009]        [<ffffffff81b1db4f>] start_kernel+0x3bb/0x3c6
      [   13.173009]        [<ffffffff81b1d29f>] x86_64_start_reservations+0xaf/0xb3
      [   13.173009]        [<ffffffff81b1d393>] x86_64_start_kernel+0xf0/0xf7
      [   13.173009]
      [   13.173009] -> #0 (&p->pi_lock){-.-.-.}:
      [   13.173009]        [<ffffffff81067788>] check_prev_add+0x68/0x20e
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
      [   13.173009]        [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]        [<ffffffff81032f3c>] wake_up_process+0x10/0x12
      [   13.173009]        [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
      [   13.173009]        [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
      [   13.173009]        [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
      [   13.173009]        [<ffffffff8103e487>] __do_softirq+0x109/0x26a
      [   13.173009]        [<ffffffff8157144c>] call_softirq+0x1c/0x30
      [   13.173009]        [<ffffffff81003207>] do_softirq+0x44/0xf1
      [   13.173009]        [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
      [   13.173009]        [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
      [   13.173009]        [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
      [   13.173009]        [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
      [   13.173009]        [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
      [   13.173009]        [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
      [   13.173009]        [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
      [   13.173009]        [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
      [   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   13.173009]
      [   13.173009] other info that might help us debug this:
      [   13.173009]
      [   13.173009] Chain exists of:
      [   13.173009]   &p->pi_lock --> &rq->lock --> rcu_node_level_0
      [   13.173009]
      [   13.173009]  Possible unsafe locking scenario:
      [   13.173009]
      [   13.173009]        CPU0                    CPU1
      [   13.173009]        ----                    ----
      [   13.173009]   lock(rcu_node_level_0);
      [   13.173009]                                lock(&rq->lock);
      [   13.173009]                                lock(rcu_node_level_0);
      [   13.173009]   lock(&p->pi_lock);
      [   13.173009]
      [   13.173009]  *** DEADLOCK ***
      [   13.173009]
      [   13.173009] 3 locks held by blkid/267:
      [   13.173009]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8156cdb4>] do_page_fault+0x1f3/0x5de
      [   13.173009]  #1:  (&yield_timer){+.-...}, at: [<ffffffff810451da>] call_timer_fn+0x0/0x1e9
      [   13.173009]  #2:  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
      [   13.173009]
      [   13.173009] stack backtrace:
      [   13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
      [   13.173009] Call Trace:
      [   13.173009]  <IRQ>  [<ffffffff8154a529>] print_circular_bug+0xc8/0xd9
      [   13.173009]  [<ffffffff81067788>] check_prev_add+0x68/0x20e
      [   13.173009]  [<ffffffff8100c861>] ? save_stack_trace+0x28/0x46
      [   13.173009]  [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]  [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]  [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff81032f3c>] wake_up_process+0x10/0x12
      [   13.173009]  [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
      [   13.173009]  [<ffffffff810451da>] ? del_timer+0x75/0x75
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
      [   13.173009]  [<ffffffff8103e487>] __do_softirq+0x109/0x26a
      [   13.173009]  [<ffffffff8106365f>] ? tick_dev_program_event+0x37/0xf6
      [   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
      [   13.173009]  [<ffffffff8157144c>] call_softirq+0x1c/0x30
      [   13.173009]  [<ffffffff81003207>] do_softirq+0x44/0xf1
      [   13.173009]  [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
      [   13.173009]  [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
      [   13.173009]  [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
      [   13.173009]  <EOI>  [<ffffffff810bd384>] ? get_page_from_freelist+0x114/0x310
      [   13.173009]  [<ffffffff810bd51a>] ? get_page_from_freelist+0x2aa/0x310
      [   13.173009]  [<ffffffff812220e7>] ? clear_page_c+0x7/0x10
      [   13.173009]  [<ffffffff810bd1ef>] ? prep_new_page+0x14c/0x1cd
      [   13.173009]  [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
      [   13.173009]  [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
      [   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
      [   13.173009]  [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
      [   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
      [   13.173009]  [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
      [   13.173009]  [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
      [   13.173009]  [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]  [<ffffffff810d915f>] ? sys_brk+0x32/0x10c
      [   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
      [   13.173009]  [<ffffffff81065c4f>] ? trace_hardirqs_off_caller+0x3f/0x9c
      [   13.173009]  [<ffffffff812235dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
      [   13.173009]  [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd
      Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8826f3b0
    • P
      atomic: Add atomic_or() · 55c2945a
      Paul E. McKenney 提交于
      An atomic_or() function is needed by TREE_RCU to avoid deadlock, so
      add a generic version.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      55c2945a
    • I
      Merge branch 'rcu/urgent' of... · 29f742f8
      Ingo Molnar 提交于
      Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent
      29f742f8
    • P
      perf: Fix SIGIO handling · f506b3dc
      Peter Zijlstra 提交于
      Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
      explicitly push the wakeup (including signals) when requested.
      Reported-by: NVince Weaver <vweaver1@eecs.utk.edu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f506b3dc
    • J
      Documentation: Add statistics about nested locks · f62508f6
      Juri Lelli 提交于
      Explain what the trailing "/1" on some lock class names of
      lock_stat output means.
      Reviewed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4DD4F6C1.5090701@gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f62508f6
    • K
      cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed · 1e1b6c51
      KOSAKI Motohiro 提交于
      The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
      tsk->cpus_allowed. Otherwise RT scheduler may confuse.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1e1b6c51
    • P
      sched: Fix ->min_vruntime calculation in dequeue_entity() · 1e876231
      Peter Zijlstra 提交于
      Dima Zavin <dima@android.com> reported:
      
      "After pulling the thread off the run-queue during a cgroup change,
      the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
      then gets normalized to this new value. This can then lead to the thread
      getting an unfair boost in the new group if the vruntime of the next
      task in the old run-queue was way further ahead."
      Reported-by: NDima Zavin <dima@android.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Recalls-having-tested-once-upon-a-time-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1e876231