1. 13 6月, 2009 1 次提交
  2. 18 6月, 2009 1 次提交
  3. 28 4月, 2009 2 次提交
    • T
      ext4: avoid unnecessary spinlock in critical POSIX ACL path · 8b0f9e8f
      Theodore Ts'o 提交于
      If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
      to do POSIX ACL checks on any files not owned by the caller, and it does
      this for every single pathname component that it looks up.
      
      That obviously can be pretty expensive if the filesystem isn't careful
      about it, especially with locking. That's doubly sad, since the common
      case tends to be that there are no ACL's associated with the files in
      question.
      
      ext4 already caches the ACL data so that it doesn't have to look it up
      over and over again, but it does so by taking the inode->i_lock spinlock
      on every lookup. Which is a noticeable overhead even if it's a private
      lock, especially on CPU's where the serialization is expensive (eg Intel
      Netburst aka 'P4').
      
      For the special case of not actually having any ACL's, all that locking is
      unnecessary. Even if somebody else were to be changing the ACL's on
      another CPU, we simply don't care - if we've seen a NULL ACL, we might as
      well use it.
      
      So just load the ACL speculatively without any locking, and if it was
      NULL, just use it. If it's non-NULL (either because we had a cached
      entry, or because the cache hasn't been filled in at all), it means that
      we'll need to get the lock and re-load it properly.
      
      (This commit was ported from a patch originally authored by Linus for
      ext3.)
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8b0f9e8f
    • L
      ext3: avoid unnecessary spinlock in critical POSIX ACL path · 96159f25
      Linus Torvalds 提交于
      If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem 
      to do POSIX ACL checks on any files not owned by the caller, and it does 
      this for every single pathname component that it looks up.
      
      That obviously can be pretty expensive if the filesystem isn't careful 
      about it, especially with locking. That's doubly sad, since the common 
      case tends to be that there are no ACL's associated with the files in 
      question.
      
      ext3 already caches the ACL data so that it doesn't have to look it up 
      over and over again, but it does so by taking the inode->i_lock spinlock 
      on every lookup. Which is a noticeable overhead even if it's a private 
      lock, especially on CPU's where the serialization is expensive (eg Intel 
      Netburst aka 'P4').
      
      For the special case of not actually having any ACL's, all that locking is 
      unnecessary. Even if somebody else were to be changing the ACL's on 
      another CPU, we simply don't care - if we've seen a NULL ACL, we might as 
      well use it.
      
      So just load the ACL speculatively without any locking, and if it was 
      NULL, just use it. If it's non-NULL (either because we had a cached 
      entry, or because the cache hasn't been filled in at all), it means that 
      we'll need to get the lock and re-load it properly.
      
      This is noticeable even on Nehalem, which does locking quite well (much 
      better than P4). From lmbench:
      
      	Processor, Processes - times in microseconds - smaller is better
      	--------------------------------------------------------------------
      	Host                 OS  Mhz null null      open slct fork exec sh  
      	                             call  I/O stat clos TCP  proc proc proc
      	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
       - before:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
      	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
       - after:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148
      
      where you can see what appears to be a roughly 3% improvement in stat
      and open/close latencies from just the removal of the locking overhead. 
      
      Of course, this only matters for files you don't own (the owner never 
      needs to do the ACL checks), but that's the common case for libraries, 
      header files, and executables. As well as for the base components of any 
      absolute pathname, even if you are the owner of the final file.
      
      [ At some point we probably want to move this ACL caching logic entirely
        into the VFS layer (and only call down to the filesystem when
        uncached), but in the meantime this improves ext3 a bit.
      
        A similar fix to btrfs makes a much bigger difference (15x improvement
        in lmbench) due to broken caching. ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      96159f25
  4. 17 6月, 2009 17 次提交
    • T
      9bffad1e
    • T
      879c5e6b
    • D
      AFS: Correctly translate auth error aborts and don't failover in such cases · 005411c3
      David Howells 提交于
      Authentication error abort codes should be translated to appropriate
      Linux error codes, rather than all being translated to EREMOTEIO - which
      indicates that the server had internal problems.
      
      Additionally, a server shouldn't be marked unavailable and the next
      server tried if an authentication error occurs.  This will quickly make
      all the servers unavailable to the client.  Instead the error should be
      returned straight to the user.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      005411c3
    • T
      CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK · 69050eee
      Tomas Szepe 提交于
      CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK.
      
      This makes it possible to run complete systems out of a CONFIG_BLOCK=n
      initramfs on current kernels again (this last worked on 2.6.27.*).
      
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69050eee
    • T
      remove put_cpu_no_resched() · 8b0b1db0
      Thomas Gleixner 提交于
      put_cpu_no_resched() is an optimization of put_cpu() which unfortunately
      can cause high latencies.
      
      The nfs iostats code uses put_cpu_no_resched() in a code sequence where a
      reschedule request caused by an interrupt between the get_cpu() and the
      put_cpu_no_resched() can delay the reschedule for at least HZ.
      
      The other users of put_cpu_no_resched() optimize correctly in interrupt
      code, but there is no real harm in using the put_cpu() function which is
      an alias for preempt_enable().  The extra check of the preemmpt count is
      not as critical as the potential source of missing a reschedule.
      
      Debugged in the preempt-rt tree and verified in mainline.
      
      Impact: remove a high latency source
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b0b1db0
    • E
      poll: avoid extra wakeups in select/poll · 4938d7e0
      Eric Dumazet 提交于
      After introduction of keyed wakeups Davide Libenzi did on epoll, we are
      able to avoid spurious wakeups in poll()/select() code too.
      
      For example, typical use of poll()/select() is to wait for incoming
      network frames on many sockets.  But TX completion for UDP/TCP frames call
      sock_wfree() which in turn schedules thread.
      
      When scheduled, thread does a full scan of all polled fds and can sleep
      again, because nothing is really available.  If number of fds is large,
      this cause significant load.
      
      This patch makes select()/poll() aware of keyed wakeups and useless
      wakeups are avoided.  This reduces number of context switches by about 50%
      on some setups, and work performed by sofirq handlers.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4938d7e0
    • R
      ntfs: use is_power_of_2() function for clarity. · 02d5341a
      Robert P. J. Day 提交于
      Signed-off-by: NRobert P. J. Day <rpjday@crashcourse.ca>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02d5341a
    • W
      writeback: skip new or to-be-freed inodes · 84a89245
      Wu Fengguang 提交于
      1) I_FREEING tests should be coupled with I_CLEAR
      
      The two I_FREEING tests are racy because clear_inode() can set i_state to
      I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
      
      2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
         races with generic_forget_inode()
      
      generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
      generic_sync_sb_inodes() shall not try to step in and create possible races:
      
        generic_forget_inode
          inode->i_state |= I_WILL_FREE;
          spin_unlock(&inode_lock);
                                             generic_sync_sb_inodes()
                                               spin_lock(&inode_lock);
                                               __iget(inode);
                                               __writeback_single_inode
                                                 // see non zero i_count
       may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                               spin_unlock(&inode_lock);
       may call generic_forget_inode again ==> iput(inode);
      
      The above race and warning didn't turn up because writeback_inodes() holds
      the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
      early.  But we are not sure the UBIFS calls and future callers will
      guarantee that.  So skip I_WILL_FREE inodes for the sake of safety.
      
      Cc: Eric Sandeen <sandeen@sandeen.net>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84a89245
    • M
      mm: remove __invalidate_mapping_pages variant · 28697355
      Mike Waychison 提交于
      Remove __invalidate_mapping_pages atomic variant now that its sole caller
      can sleep (fixed in eccb95ce ("vfs: fix
      lock inversion in drop_pagecache_sb()")).
      
      This fixes softlockups that can occur while in the drop_caches path.
      Signed-off-by: NMike Waychison <mikew@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28697355
    • D
      oom: move oom_adj value from task_struct to mm_struct · 2ff05b2b
      David Rientjes 提交于
      The per-task oom_adj value is a characteristic of its mm more than the
      task itself since it's not possible to oom kill any thread that shares the
      mm.  If a task were to be killed while attached to an mm that could not be
      freed because another thread were set to OOM_DISABLE, it would have
      needlessly been terminated since there is no potential for future memory
      freeing.
      
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct mm_struct.  This requires task_lock() on a
      task to check its oom_adj value to protect against exec, but it's already
      necessary to take the lock when dereferencing the mm to find the total VM
      size for the badness heuristic.
      
      This fixes a livelock if the oom killer chooses a task and another thread
      sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
      because oom_kill_task() repeatedly returns 1 and refuses to kill the
      chosen task while select_bad_process() will repeatedly choose the same
      task during the next retry.
      
      Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
      oom_kill_task() to check for threads sharing the same memory will be
      removed in the next patch in this series where it will no longer be
      necessary.
      
      Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
      these threads are immune from oom killing already.  They simply report an
      oom_adj value of OOM_DISABLE.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff05b2b
    • K
      mm: remove CONFIG_UNEVICTABLE_LRU config option · 68377659
      KOSAKI Motohiro 提交于
      Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
      configurability is unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68377659
    • W
      proc: export more page flags in /proc/kpageflags · 17797549
      Wu Fengguang 提交于
      Export all page flags faithfully in /proc/kpageflags.
      
      	11. KPF_MMAP		(pseudo flag) memory mapped page
      	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
      	13. KPF_SWAPCACHE	page is in swap cache
      	14. KPF_SWAPBACKED	page is swap/RAM backed
      	15. KPF_COMPOUND_HEAD	(*)
      	16. KPF_COMPOUND_TAIL	(*)
      	17. KPF_HUGE		hugeTLB pages
      	18. KPF_UNEVICTABLE	page is in the unevictable LRU list
      	19. KPF_HWPOISON(TBD)	hardware detected corruption
      	20. KPF_NOPAGE		(pseudo flag) no page frame at the address
      	32-39.			more obscure flags for kernel developers
      
      	(*) For compound pages, exporting _both_ head/tail info enables
      	    users to tell where a compound page starts/ends, and its order.
      
      The accompanying page-types tool will handle the details like decoupling
      overloaded flags and hiding obscure flags to normal users.
      
      Thanks to KOSAKI and Andi for their valuable recommendations!
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17797549
    • W
      proc: kpagecount/kpageflags code cleanup · ed7ce0f1
      Wu Fengguang 提交于
      Move increments of pfn/out to bottom of the loop.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed7ce0f1
    • W
      mm: introduce PageHuge() for testing huge/gigantic pages · 20a0307c
      Wu Fengguang 提交于
      A series of patches to enhance the /proc/pagemap interface and to add a
      userspace executable which can be used to present the pagemap data.
      
      Export 10 more flags to end users (and more for kernel developers):
      
              11. KPF_MMAP            (pseudo flag) memory mapped page
              12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
              13. KPF_SWAPCACHE       page is in swap cache
              14. KPF_SWAPBACKED      page is swap/RAM backed
              15. KPF_COMPOUND_HEAD   (*)
              16. KPF_COMPOUND_TAIL   (*)
              17. KPF_HUGE		hugeTLB pages
              18. KPF_UNEVICTABLE     page is in the unevictable LRU list
              19. KPF_HWPOISON        hardware detected corruption
              20. KPF_NOPAGE          (pseudo flag) no page frame at the address
      
              (*) For compound pages, exporting _both_ head/tail info enables
                  users to tell where a compound page starts/ends, and its order.
      
      a simple demo of the page-types tool
      
      # ./page-types -h
      page-types [options]
                  -r|--raw                  Raw mode, for kernel developers
                  -a|--addr    addr-spec    Walk a range of pages
                  -b|--bits    bits-spec    Walk pages with specified bits
                  -l|--list                 Show page details in ranges
                  -L|--list-each            Show page details one by one
                  -N|--no-summary           Don't show summay info
                  -h|--help                 Show this usage message
      addr-spec:
                  N                         one page at offset N (unit: pages)
                  N+M                       pages range from N to N+M-1
                  N,M                       pages range from N to M-1
                  N,                        pages range from N to end
                  ,M                        pages range from 0 to M
      bits-spec:
                  bit1,bit2                 (flags & (bit1|bit2)) != 0
                  bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
                  bit1,~bit2                (flags & (bit1|bit2)) == bit1
                  =bit1,bit2                flags == (bit1|bit2)
      bit-names:
                locked              error         referenced           uptodate
                 dirty                lru             active               slab
             writeback            reclaim              buddy               mmap
             anonymous          swapcache         swapbacked      compound_head
         compound_tail               huge        unevictable           hwpoison
                nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
               private(r)       private_2(r)   owner_private(r)            arch(r)
              uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
            slub_debug(o)
                                         (r) raw mode bits  (o) overloaded bits
      
      # ./page-types
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          487369     1903  _________________________________
      0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
      0x0000000000000020               1        0  _____l___________________________  lru
      0x0000000000000024              34        0  __R__l___________________________  referenced,lru
      0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
      0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
      0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
      0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
      0x0000000000000040            8344       32  ______A__________________________  active
      0x0000000000000060               1        0  _____lA__________________________  lru,active
      0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
      0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
      0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
      0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
      0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
      0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
      0x0000000000000400             503        1  __________B______________________  buddy
      0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
      0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
      0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
      0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
      0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
      0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
      0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
      0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
      0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
      0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
      0x0000000000001000             492        1  ____________a____________________  anonymous
      0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
      0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total          513968     2007
      
      # ./page-types -r
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          468002     1828  _________________________________
      0x0000000100000000           19102       74  _____________________r___________  reserved
      0x0000000000008000              41        0  _______________H_________________  compound_head
      0x0000000000010000             188        0  ________________T________________  compound_tail
      0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
      0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
      0x0000000000000020               1        0  _____l___________________________  lru
      0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
      0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
      0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
      0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
      0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
      0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
      0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
      0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
      0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
      0x0000000800000040            8124       31  ______A_________________P________  active,private
      0x0000000000000040             219        0  ______A__________________________  active
      0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
      0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
      0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
      0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
      0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
      0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
      0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
      0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
      0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
      0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
      0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
      0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
      0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
      0x0000000000000400             538        2  __________B______________________  buddy
      0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
      0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
      0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
      0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
      0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
      0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
      0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
      0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
      0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
      0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
      0x0000000000001000             492        1  ____________a____________________  anonymous
      0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
      0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
      0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
      0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total          513968     2007
      
      # ./page-types --raw --list --no-summary --bits reserved
      offset  count   flags
      0       15      _____________________r___________
      31      4       _____________________r___________
      159     97      _____________________r___________
      4096    2067    _____________________r___________
      6752    2390    _____________________r___________
      9355    3       _____________________r___________
      9728    14526   _____________________r___________
      
      This patch:
      
      Introduce PageHuge(), which identifies huge/gigantic pages by their
      dedicated compound destructor functions.
      
      Also move prep_compound_gigantic_page() to hugetlb.c and make
      __free_pages_ok() non-static.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a0307c
    • O
      send_sigio_to_task: sanitize the usage of fown->signum · 8eeee4e2
      Oleg Nesterov 提交于
      send_sigio_to_task() reads fown->signum several times, we can race with
      F_SETSIG which changes ->signum lockless.  In theory, this can fool
      security checks or we can call group_send_sig_info() with the wrong
      ->si_signo which does not match "int sig".
      
      Change the code to cache ->signum.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8eeee4e2
    • O
      shift current_cred() from __f_setown() to f_modown() · 2f38d70f
      Oleg Nesterov 提交于
      Shift current_cred() from __f_setown() to f_modown(). This reduces
      the number of arguments and saves 48 bytes from fs/fcntl.o.
      
      [ Note: this doesn't clear euid/uid when pid is set to NULL.  But if
        f_owner.pid == NULL we never use f_owner.uid/euid.  Otherwise we'd
        have a bug anyway: we must not send signals if pid was reset to NULL.  ]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f38d70f
    • D
      jfs: fix regression preventing coalescing of extents · f7c52fd1
      Dave Kleikamp 提交于
      Commit fec1878f caused a regression in
      which contiguous blocks being allocated to the end of an extent were
      getting a new extent created.  This typically results in files entirely
      made up of 1-block extents even though the blocks are contiguous on
      disk.
      
      Apparently grub doesn't handle a jfs file being fragmented into too many
      extents, since it refuses to boot a kernel from jfs that was created by
      the 2.6.30 kernel.
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Reported-by: NAlex <alevkovich@tut.by>
      f7c52fd1
  5. 16 6月, 2009 13 次提交
    • L
      block: remove some includings of blktrace_api.h · e212d6f2
      Li Zefan 提交于
      When porting blktrace to tracepoints, we changed to trace/block.h
      for trace prober declarations.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e212d6f2
    • J
      ubifs: register backing_dev_info · a979eff1
      Jens Axboe 提交于
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a979eff1
    • J
      btrfs: properly register fs backing device · ad081f14
      Jens Axboe 提交于
      btrfs assigns this bdi to all inodes on that file system, so make
      sure it's registered. This isn't really important now, but will be
      when we put dirty inodes there. Even now, we miss the stats when the
      bdi isn't visible.
      
      Also fixes failure to check bdi_init() return value, and bad inherit of
      ->capabilities flags from the default bdi.
      Acked-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ad081f14
    • A
      NLS: update handling of Unicode · 74675a58
      Alan Stern 提交于
      This patch (as1239) updates the kernel's treatment of Unicode.  The
      character-set conversion routines are well behind the current state of
      the Unicode specification: They don't recognize the existence of code
      points beyond plane 0 or of surrogate pairs in the UTF-16 encoding.
      
      The old wchar_t 16-bit type is retained because it's still used in
      lots of places.  This shouldn't cause any new problems; if a
      conversion now results in an invalid 16-bit code then before it must
      have yielded an undefined code.
      
      Difficult-to-read names like "utf_mbstowcs" are replaced with more
      transparent names like "utf8s_to_utf16s" and the ordering of the
      parameters is rationalized (buffer lengths come immediate after the
      pointers they refer to, and the inputs precede the outputs).
      Fortunately the low-level conversion routines are used in only a few
      places; the interfaces to the higher-level uni2char and char2uni
      methods have been left unchanged.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Acked-by: NClemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      74675a58
    • C
      nls: utf8_wcstombs: fix buffer overflow · 905c02ac
      Clemens Ladisch 提交于
      utf8_wcstombs forgot to include one-byte UTF-8 characters when
      calculating the output buffer size, i.e., theoretically, it was possible
      to overflow the output buffer with an input string that contains enough
      ASCII characters.
      
      In practice, this was no problem because the only user so far (VFAT)
      always uses a big enough output buffer.
      Signed-off-by: NClemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      905c02ac
    • C
      nls: utf8_wcstombs: use correct buffer size in error case · e27ecdd9
      Clemens Ladisch 提交于
      When utf8_wcstombs encounters a character that cannot be encoded, we
      must not decrease the remaining output buffer size because nothing has
      been written to the output buffer.
      Signed-off-by: NClemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      e27ecdd9
    • R
      debugfs: use specified mode to possibly mark files read/write only · e4792aa3
      Robin Getz 提交于
      In many SoC implementations there are hardware registers can be read or
      write only.  This extends the debugfs to enforce the file permissions for
      these types of registers by providing a set of fops which are read or
      write only.  This assumes that the kernel developer knows more about the
      hardware than the user (even root users) -- which is normally true.
      Signed-off-by: NRobin Getz <rgetz@blackfin.uclinux.org>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NBryan Wu <cooloney@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      e4792aa3
    • J
      debugfs: fix docbook error · 400ced61
      Jonathan Corbet 提交于
      Fix an error in debugfs_create_blob's docbook description
      
      It cannot actually be used to write a binary blob.
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      400ced61
    • S
      debugfs: dont stop on first failed recursive delete · 56a83cc9
      Steven Rostedt 提交于
      debugfs: dont stop on first failed recursive delete
      
      While running a while loop of removing a module that removes a debugfs
      directory with debugfs_remove_recursive, and at the same time doing a
      while loop of cat of a file in that directory, I would hit a point where
      somehow the cat of the file caused the remove to fail.
      
      The result is that other files did not get removed when the module
      was removed. I simple read of one of those file can oops the kernel
      because the operations to the file no longer exist (removed by module).
      
      The funny thing is that the file being cat'ed was removed. It was
      the siblings that were not. I see in the code to debugfs_remove_recursive
      there's a test that checks if the child fails to bail out of the loop
      to prevent an infinite loop.
      
      What this patch does is to still try any siblings in that directory.
      If all the siblings fail, or there are no more siblings, then we exit
      the loop.
      
      This fixes the above symptom, but...
      
      This is no full proof. It makes the debugfs_remove_recursive a bit more
      robust, but it does not explain why the one file failed. There may
      be some kind of delay deletion that makes the debugfs think it did
      not succeed. So this patch is more of a fix for the symptom but not
      the disease.
      
      This patch still makes the debugfs_remove_recursive more robust and
      until I can find out why the bug exists, this patch will keep
      the kernel from oopsing in most cases.  Even after the cause is found
      I think this change can stand on its own and should be kept.
      
      [ Impact: prevent kernel oops on module unload and reading debugfs files ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      56a83cc9
    • A
      Sysfs: fix possible memleak in sysfs_follow_link · 557411eb
      Armin Kuster 提交于
      There is the possiblity of a memory leak if a page is allocated and if
      sysfs_getlink() fails in the sysfs_follow_link.
      Signed-off-by: NArmin Kuster <akuster@mvista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      557411eb
    • Y
      Btrfs: always update root items for fs trees at commit time · 978d910d
      Yan Zheng 提交于
      commit_fs_roots skips updating root items for fs trees that aren't modified.
      This is unsafe now that relocation code modifies root item's last_snapshot
      field without modifying corresponding fs tree.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      978d910d
    • S
      ocfs2/net: Use wait_event() in o2net_send_message_vec() · 9af0b38f
      Sunil Mushran 提交于
      Replace wait_event_interruptible() with wait_event() in o2net_send_message_vec().
      This is because this function is called by the dlm that expects signals to be
      blocked.
      
      Fixes oss bugzilla#1126
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1126Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      9af0b38f
    • T
      ocfs2: Adjust rightmost path in ocfs2_add_branch. · 6b791bcc
      Tao Ma 提交于
      In ocfs2_add_branch, we use the rightmost rec of the leaf extent block
      to generate the e_cpos for the newly added branch. In the most case, it
      is OK but if the parent extent block's rightmost rec covers more clusters
      than the leaf does, it will cause kernel panic if we insert some clusters
      in it. The message is something like:
      (7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression:
      le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count)
      (7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28,
      next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1
       [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2]
       [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2]
       [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2]
      ...
      
      The panic can be easily reproduced by the following small test case
      (with bs=512, cs=4K, and I remove all the error handling so that it looks
      clear enough for reading).
      
      int main(int argc, char **argv)
      {
      	int fd, i;
      	char buf[5] = "test";
      
      	fd = open(argv[1], O_RDWR|O_CREAT);
      
      	for (i = 0; i < 30; i++) {
      		lseek(fd, 40960 * i, SEEK_SET);
      		write(fd, buf, 5);
      	}
      
      	ftruncate(fd, 1146880);
      
      	lseek(fd, 1126400, SEEK_SET);
      	write(fd, buf, 5);
      
      	close(fd);
      
      	return 0;
      }
      
      The reason of the panic is that:
      the 30 writes and the ftruncate makes the file's extent list looks like:
      
      	Tree Depth: 1   Count: 19   Next Free Rec: 1
      	## Offset        Clusters       Block#
      	0  0             280            86183
      	SubAlloc Bit: 7   SubAlloc Slot: 0
      	Blknum: 86183   Next Leaf: 0
      	CRC32: 00000000   ECC: 0000
      	Tree Depth: 0   Count: 28   Next Free Rec: 28
      	## Offset        Clusters       Block#          Flags
      	0  0             1              143368          0x0
      	1  10            1              143376          0x0
      	...
      	26 260           1              143576          0x0
      	27 270           1              143584          0x0
      
      Now another write at 1126400(275 cluster) whiich will write at the gap
      between 271 and 280 will trigger ocfs2_add_branch, but the result after
      the function looks like:
      	Tree Depth: 1   Count: 19   Next Free Rec: 2
      	## Offset        Clusters       Block#
      	0  0             280            86183
      	1  271           0             143592
      So the extent record is intersected and make the following operation bug out.
      
      This patch just try to remove the gap before we add the new branch, so that
      the root(branch) rightmost rec will cover the same right position. So in the
      above case, before adding branch the tree will be changed to
      	Tree Depth: 1   Count: 19   Next Free Rec: 1
      	## Offset        Clusters       Block#
      	0  0             271            86183
      	SubAlloc Bit: 7   SubAlloc Slot: 0
      	Blknum: 86183   Next Leaf: 0
      	CRC32: 00000000   ECC: 0000
      	Tree Depth: 0   Count: 28   Next Free Rec: 28
      	## Offset        Clusters       Block#          Flags
      	0  0             1              143368          0x0
      	1  10            1              143376          0x0
      	...
      	26 260           1              143576          0x0
      	27 270           1              143584          0x0
      And after branch add, the tree looks like
      	Tree Depth: 1   Count: 19   Next Free Rec: 2
      	## Offset        Clusters       Block#
      	0  0             271            86183
      	1  271           0             143592
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      6b791bcc
  6. 15 6月, 2009 1 次提交
    • M
      ramfs: ignore unknown mount options · 0a8eba9b
      Mike Frysinger 提交于
      On systems where CONFIG_SHMEM is disabled, mounting tmpfs filesystems can
      fail when tmpfs options are used.  This is because tmpfs creates a small
      wrapper around ramfs which rejects unknown options, and ramfs itself only
      supports a tiny subset of what tmpfs supports.  This makes it pretty hard
      to use the same userspace systems across different configuration systems.
      As such, ramfs should ignore the tmpfs options when tmpfs is merely a
      wrapper around ramfs.
      
      This used to work before commit c3b1b1cb as previously, ramfs would
      ignore all options.  But now, we get:
      ramfs: bad mount option: size=10M
      mount: mounting mdev on /dev failed: Invalid argument
      
      Another option might be to restore the previous behavior, where ramfs
      simply ignored all unknown mount options ... which is what Hugh prefers.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a8eba9b
  7. 13 6月, 2009 5 次提交