1. 20 6月, 2009 1 次提交
    • E
      inotify: inotify_destroy_mark_entry could get called twice · 528da3e9
      Eric Paris 提交于
      inotify_destroy_mark_entry could get called twice for the same mark since it
      is called directly in inotify_rm_watch and when the mark is being destroyed for
      another reason.  As an example assume that the file being watched was just
      deleted so inotify_destroy_mark_entry would get called from the path
      fsnotify_inoderemove() -> fsnotify_destroy_marks_by_inode() ->
      fsnotify_destroy_mark_entry() -> inotify_destroy_mark_entry().  If this
      happened at the same time as userspace tried to remove a watch via
      inotify_rm_watch we could attempt to remove the mark from the idr twice and
      could thus double dec the ref cnt and potentially could be in a use after
      free/double free situation.  The fix is to have inotify_rm_watch use the
      generic recursive safe fsnotify_destroy_mark_by_entry() so we are sure the
      inotify_destroy_mark_entry() function can only be called one.
      
      This patch also renames the function to inotify_ingored_remove_idr() so it is
      clear what is actually going on in the function.
      
      Hopefully this fixes:
      [   20.342058] idr_remove called for id=20 which is not allocated.
      [   20.348000] Pid: 1860, comm: udevd Not tainted 2.6.30-tip #1077
      [   20.353933] Call Trace:
      [   20.356410]  [<ffffffff811a82b7>] idr_remove+0x115/0x18f
      [   20.361737]  [<ffffffff8134259d>] ? _spin_lock+0x6d/0x75
      [   20.367061]  [<ffffffff8111640a>] ? inotify_destroy_mark_entry+0xa3/0xcf
      [   20.373771]  [<ffffffff8111641e>] inotify_destroy_mark_entry+0xb7/0xcf
      [   20.380306]  [<ffffffff81115913>] inotify_freeing_mark+0xe/0x10
      [   20.386238]  [<ffffffff8111410d>] fsnotify_destroy_mark_by_entry+0x143/0x170
      [   20.393293]  [<ffffffff811163a3>] inotify_destroy_mark_entry+0x3c/0xcf
      [   20.399829]  [<ffffffff811164d1>] sys_inotify_rm_watch+0x9b/0xc6
      [   20.405850]  [<ffffffff8100bcdb>] system_call_fastpath+0x16/0x1b
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Tested-by: NPeter Ziljlstra <peterz@infradead.org>
      528da3e9
  2. 19 6月, 2009 15 次提交
  3. 18 6月, 2009 1 次提交
  4. 17 6月, 2009 23 次提交
    • D
      Hugetlbfs: Enable hugetlbfs for more systems in Kconfig. · 852969b2
      David Daney 提交于
      As part of adding hugetlbfs support for MIPS, I am adding a new
      kconfig variable 'SYS_SUPPORTS_HUGETLBFS'.  Since some mips cpu
      varients don't yet support it, we can enable selection of HUGETLBFS on
      a system by system basis from the arch/mips/Kconfig.
      Signed-off-by: NDavid Daney <ddaney@caviumnetworks.com>
      CC: William Irwin <wli@holomorphy.com>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      852969b2
    • A
      get rid of BKL in fs/sysv · 5ac3455a
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5ac3455a
    • A
      get rid of BKL in fs/minix · cc46759a
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cc46759a
    • A
      get rid of BKL in fs/efs · e7ec952f
      Al Viro 提交于
      Only readdir() really needed it, and that's easily fixable by switch to
      generic_file_llseek()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e7ec952f
    • A
      befs ->pust_super() doesn't need BKL · 536c9490
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      536c9490
    • A
      Cleanup of adfs headers · 608ba50b
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      608ba50b
    • A
      9P doesn't need BKL in ->umount_begin() · ee450f79
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ee450f79
    • A
      fuse doesn't need BKL in ->umount_begin() · 66c6af2e
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      66c6af2e
    • A
      No instance of ->bmap() needs BKL · fe36adf4
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe36adf4
    • J
      remove unlock_kernel() left accidentally · b0895513
      J. R. Okajima 提交于
      commit 337eb00a
      Push BKL down into ->remount_fs()
      and
      commit 4aa98cf7
      Push BKL down into do_remount_sb()
      
      were uncorrectly merged.
      The former removes one pair of lock/unlock_kernel(), but the latter adds
      several unlock_kernel(). Finally a few unlock_kernel() calls left.
      Signed-off-by: NJ. R. Okajima <hooanon05@yahoo.co.jp>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0895513
    • T
      ext4: avoid unnecessary spinlock in critical POSIX ACL path · 210ad6ae
      Theodore Ts'o 提交于
      If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
      to do POSIX ACL checks on any files not owned by the caller, and it does
      this for every single pathname component that it looks up.
      
      That obviously can be pretty expensive if the filesystem isn't careful
      about it, especially with locking. That's doubly sad, since the common
      case tends to be that there are no ACL's associated with the files in
      question.
      
      ext4 already caches the ACL data so that it doesn't have to look it up
      over and over again, but it does so by taking the inode->i_lock spinlock
      on every lookup. Which is a noticeable overhead even if it's a private
      lock, especially on CPU's where the serialization is expensive (eg Intel
      Netburst aka 'P4').
      
      For the special case of not actually having any ACL's, all that locking is
      unnecessary. Even if somebody else were to be changing the ACL's on
      another CPU, we simply don't care - if we've seen a NULL ACL, we might as
      well use it.
      
      So just load the ACL speculatively without any locking, and if it was
      NULL, just use it. If it's non-NULL (either because we had a cached
      entry, or because the cache hasn't been filled in at all), it means that
      we'll need to get the lock and re-load it properly.
      
      (This commit was ported from a patch originally authored by Linus for
      ext3.)
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      210ad6ae
    • L
      ext3: avoid unnecessary spinlock in critical POSIX ACL path · 9c64daff
      Linus Torvalds 提交于
      If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
      to do POSIX ACL checks on any files not owned by the caller, and it does
      this for every single pathname component that it looks up.
      
      That obviously can be pretty expensive if the filesystem isn't careful
      about it, especially with locking. That's doubly sad, since the common
      case tends to be that there are no ACL's associated with the files in
      question.
      
      ext3 already caches the ACL data so that it doesn't have to look it up
      over and over again, but it does so by taking the inode->i_lock spinlock
      on every lookup. Which is a noticeable overhead even if it's a private
      lock, especially on CPU's where the serialization is expensive (eg Intel
      Netburst aka 'P4').
      
      For the special case of not actually having any ACL's, all that locking is
      unnecessary. Even if somebody else were to be changing the ACL's on
      another CPU, we simply don't care - if we've seen a NULL ACL, we might as
      well use it.
      
      So just load the ACL speculatively without any locking, and if it was
      NULL, just use it. If it's non-NULL (either because we had a cached
      entry, or because the cache hasn't been filled in at all), it means that
      we'll need to get the lock and re-load it properly.
      
      This is noticeable even on Nehalem, which does locking quite well (much
      better than P4). From lmbench:
      
      	Processor, Processes - times in microseconds - smaller is better
      	--------------------------------------------------------------------
      	Host                 OS  Mhz null null      open slct fork exec sh
      	                             call  I/O stat clos TCP  proc proc proc
      	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
       - before:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
      	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
       - after:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148
      
      where you can see what appears to be a roughly 3% improvement in stat
      and open/close latencies from just the removal of the locking overhead.
      
      Of course, this only matters for files you don't own (the owner never
      needs to do the ACL checks), but that's the common case for libraries,
      header files, and executables. As well as for the base components of any
      absolute pathname, even if you are the owner of the final file.
      
      [ At some point we probably want to move this ACL caching logic entirely
        into the VFS layer (and only call down to the filesystem when
        uncached), but in the meantime this improves ext3 a bit.
      
        A similar fix to btrfs makes a much bigger difference (15x improvement
        in lmbench) due to broken caching. ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c64daff
    • D
      AFS: Correctly translate auth error aborts and don't failover in such cases · 005411c3
      David Howells 提交于
      Authentication error abort codes should be translated to appropriate
      Linux error codes, rather than all being translated to EREMOTEIO - which
      indicates that the server had internal problems.
      
      Additionally, a server shouldn't be marked unavailable and the next
      server tried if an authentication error occurs.  This will quickly make
      all the servers unavailable to the client.  Instead the error should be
      returned straight to the user.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      005411c3
    • T
      CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK · 69050eee
      Tomas Szepe 提交于
      CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK.
      
      This makes it possible to run complete systems out of a CONFIG_BLOCK=n
      initramfs on current kernels again (this last worked on 2.6.27.*).
      
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69050eee
    • T
      remove put_cpu_no_resched() · 8b0b1db0
      Thomas Gleixner 提交于
      put_cpu_no_resched() is an optimization of put_cpu() which unfortunately
      can cause high latencies.
      
      The nfs iostats code uses put_cpu_no_resched() in a code sequence where a
      reschedule request caused by an interrupt between the get_cpu() and the
      put_cpu_no_resched() can delay the reschedule for at least HZ.
      
      The other users of put_cpu_no_resched() optimize correctly in interrupt
      code, but there is no real harm in using the put_cpu() function which is
      an alias for preempt_enable().  The extra check of the preemmpt count is
      not as critical as the potential source of missing a reschedule.
      
      Debugged in the preempt-rt tree and verified in mainline.
      
      Impact: remove a high latency source
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b0b1db0
    • E
      poll: avoid extra wakeups in select/poll · 4938d7e0
      Eric Dumazet 提交于
      After introduction of keyed wakeups Davide Libenzi did on epoll, we are
      able to avoid spurious wakeups in poll()/select() code too.
      
      For example, typical use of poll()/select() is to wait for incoming
      network frames on many sockets.  But TX completion for UDP/TCP frames call
      sock_wfree() which in turn schedules thread.
      
      When scheduled, thread does a full scan of all polled fds and can sleep
      again, because nothing is really available.  If number of fds is large,
      this cause significant load.
      
      This patch makes select()/poll() aware of keyed wakeups and useless
      wakeups are avoided.  This reduces number of context switches by about 50%
      on some setups, and work performed by sofirq handlers.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4938d7e0
    • R
      ntfs: use is_power_of_2() function for clarity. · 02d5341a
      Robert P. J. Day 提交于
      Signed-off-by: NRobert P. J. Day <rpjday@crashcourse.ca>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02d5341a
    • W
      writeback: skip new or to-be-freed inodes · 84a89245
      Wu Fengguang 提交于
      1) I_FREEING tests should be coupled with I_CLEAR
      
      The two I_FREEING tests are racy because clear_inode() can set i_state to
      I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
      
      2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
         races with generic_forget_inode()
      
      generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
      generic_sync_sb_inodes() shall not try to step in and create possible races:
      
        generic_forget_inode
          inode->i_state |= I_WILL_FREE;
          spin_unlock(&inode_lock);
                                             generic_sync_sb_inodes()
                                               spin_lock(&inode_lock);
                                               __iget(inode);
                                               __writeback_single_inode
                                                 // see non zero i_count
       may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                               spin_unlock(&inode_lock);
       may call generic_forget_inode again ==> iput(inode);
      
      The above race and warning didn't turn up because writeback_inodes() holds
      the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
      early.  But we are not sure the UBIFS calls and future callers will
      guarantee that.  So skip I_WILL_FREE inodes for the sake of safety.
      
      Cc: Eric Sandeen <sandeen@sandeen.net>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84a89245
    • M
      mm: remove __invalidate_mapping_pages variant · 28697355
      Mike Waychison 提交于
      Remove __invalidate_mapping_pages atomic variant now that its sole caller
      can sleep (fixed in eccb95ce ("vfs: fix
      lock inversion in drop_pagecache_sb()")).
      
      This fixes softlockups that can occur while in the drop_caches path.
      Signed-off-by: NMike Waychison <mikew@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28697355
    • D
      oom: move oom_adj value from task_struct to mm_struct · 2ff05b2b
      David Rientjes 提交于
      The per-task oom_adj value is a characteristic of its mm more than the
      task itself since it's not possible to oom kill any thread that shares the
      mm.  If a task were to be killed while attached to an mm that could not be
      freed because another thread were set to OOM_DISABLE, it would have
      needlessly been terminated since there is no potential for future memory
      freeing.
      
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct mm_struct.  This requires task_lock() on a
      task to check its oom_adj value to protect against exec, but it's already
      necessary to take the lock when dereferencing the mm to find the total VM
      size for the badness heuristic.
      
      This fixes a livelock if the oom killer chooses a task and another thread
      sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
      because oom_kill_task() repeatedly returns 1 and refuses to kill the
      chosen task while select_bad_process() will repeatedly choose the same
      task during the next retry.
      
      Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
      oom_kill_task() to check for threads sharing the same memory will be
      removed in the next patch in this series where it will no longer be
      necessary.
      
      Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
      these threads are immune from oom killing already.  They simply report an
      oom_adj value of OOM_DISABLE.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff05b2b
    • K
      mm: remove CONFIG_UNEVICTABLE_LRU config option · 68377659
      KOSAKI Motohiro 提交于
      Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
      configurability is unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68377659
    • W
      proc: export more page flags in /proc/kpageflags · 17797549
      Wu Fengguang 提交于
      Export all page flags faithfully in /proc/kpageflags.
      
      	11. KPF_MMAP		(pseudo flag) memory mapped page
      	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
      	13. KPF_SWAPCACHE	page is in swap cache
      	14. KPF_SWAPBACKED	page is swap/RAM backed
      	15. KPF_COMPOUND_HEAD	(*)
      	16. KPF_COMPOUND_TAIL	(*)
      	17. KPF_HUGE		hugeTLB pages
      	18. KPF_UNEVICTABLE	page is in the unevictable LRU list
      	19. KPF_HWPOISON(TBD)	hardware detected corruption
      	20. KPF_NOPAGE		(pseudo flag) no page frame at the address
      	32-39.			more obscure flags for kernel developers
      
      	(*) For compound pages, exporting _both_ head/tail info enables
      	    users to tell where a compound page starts/ends, and its order.
      
      The accompanying page-types tool will handle the details like decoupling
      overloaded flags and hiding obscure flags to normal users.
      
      Thanks to KOSAKI and Andi for their valuable recommendations!
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17797549
    • W
      proc: kpagecount/kpageflags code cleanup · ed7ce0f1
      Wu Fengguang 提交于
      Move increments of pfn/out to bottom of the loop.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed7ce0f1