1. 08 2月, 2008 19 次提交
  2. 07 2月, 2008 3 次提交
  3. 06 2月, 2008 18 次提交
    • P
      nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces. · f905bc44
      Paul Mundt 提交于
      This builds on top of the earlier vmalloc_32_user() work introduced by
      b5073173, as we now have places in the nommu
      allmodconfig that hit up against these missing APIs.
      
      As vmalloc_32_user() is already implemented, this is moved over to
      vmalloc_user() and simply made a wrapper.  As all current nommu platforms are
      32-bit addressable, there's no special casing we have to do for ZONE_DMA and
      things of that nature as per GFP_VMALLOC32.
      
      remap_vmalloc_range() needs to check VM_USERMAP in order to figure out whether
      we permit the remap or not, which means that we also have to rework the
      vmalloc_user() code to grovel for the VMA and set the flag.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NDavid McCullough <david_mccullough@securecomputing.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f905bc44
    • S
      oom_kill: remove uid==0 checks · 97829955
      Serge E. Hallyn 提交于
      Root processes are considered more important when out of memory and killing
      proceses.  The check for CAP_SYS_ADMIN was augmented with a check for
      uid==0 or euid==0.
      
      There are several possible ways to look at this:
      
      	1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN
      	   alone.  However CAP_SYS_RESOURCE is the one that really
      	   means "give me extra resources" so allow for that as
      	   well.
      	2. Any privileged code should be protected, but uid is not
      	   an indication of privilege.  So we should check whether
      	   any capabilities are raised.
      	3. uid==0 makes processes on the host as well as in containers
      	   more important, so we should keep the existing checks.
      	4. uid==0 makes processes only on the host more important,
      	   even without any capabilities.  So we should be keeping
      	   the (uid==0||euid==0) check but only when
      	   userns==&init_user_ns.
      
      I'm following number 1 here.
      Signed-off-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Andrew Morgan <morgan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97829955
    • A
      Add 64-bit capability support to the kernel · e338d263
      Andrew Morgan 提交于
      The patch supports legacy (32-bit) capability userspace, and where possible
      translates 32-bit capabilities to/from userspace and the VFS to 64-bit
      kernel space capabilities.  If a capability set cannot be compressed into
      32-bits for consumption by user space, the system call fails, with -ERANGE.
      
      FWIW libcap-2.00 supports this change (and earlier capability formats)
      
       http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/
      
      [akpm@linux-foundation.org: coding-syle fixes]
      [akpm@linux-foundation.org: use get_task_comm()]
      [ezk@cs.sunysb.edu: build fix]
      [akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
      [akpm@linux-foundation.org: unused var]
      [serue@us.ibm.com: export __cap_ symbols]
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NErez Zadok <ezk@cs.sunysb.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e338d263
    • D
      VFS/Security: Rework inode_getsecurity and callers to return resulting buffer · 42492594
      David P. Quigley 提交于
      This patch modifies the interface to inode_getsecurity to have the function
      return a buffer containing the security blob and its length via parameters
      instead of relying on the calling function to give it an appropriately sized
      buffer.
      
      Security blobs obtained with this function should be freed using the
      release_secctx LSM hook.  This alleviates the problem of the caller having to
      guess a length and preallocate a buffer for this function allowing it to be
      used elsewhere for Labeled NFS.
      
      The patch also removed the unused err parameter.  The conversion is similar to
      the one performed by Al Viro for the security_getprocattr hook.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42492594
    • M
      slob: reduce external fragmentation by using three free lists · 20cecbae
      Matt Mackall 提交于
      By putting smaller objects on their own list, we greatly reduce overall
      external fragmentation and increase repeatability.  This reduces total SLOB
      overhead from > 50% to ~6% on a simple boot test.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20cecbae
    • M
      slob: fix free block merging at head of subpage · 679299b3
      Matt Mackall 提交于
      We weren't merging freed blocks at the beginning of the free list.  Fixing
      this showed a 2.5% efficiency improvement in a userspace test harness.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      679299b3
    • F
      writeback: speed up writeback of big dirty files · 8bc3be27
      Fengguang Wu 提交于
      After making dirty a 100M file, the normal behavior is to start the
      writeback for all data after 30s delays.  But sometimes the following
      happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      1) initial state with a 100M file and a 1K file
      
      2) 4M written, nr_to_write <= 0, so write more
      
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all
      been written out.  The big dirty file is actually still sitting in
      s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
      becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
      may starve newly expired inodes in s_dirty.  It is also not an option to
      draw inodes from both s_more_io and s_dirty, an let the loop go on: this
      might lead to live locks, and might also starve other superblocks in sync
      time(well kupdate may still starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0
      does not necessarily mean that "all data are written".  This patch
      introduces a flag writeback_control.more_io to indicate that more io should
      be done.  With it the big dirty file no longer has to wait for the next
      kupdate invokation 5s later.
      
      In sync_sb_inodes() we only set more_io on super_blocks we actually
      visited.  This avoids the interaction between two pdflush deamons.
      
      Also in __sync_single_inode() we don't blindly keep requeuing the io if the
      filesystem cannot progress.  Failing to do so may lead to 100% iowait.
      Tested-by: NMike Snitzer <snitzer@gmail.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bc3be27
    • S
      mm: fix section mismatch warning in sparse.c · a322f8ab
      Sam Ravnborg 提交于
      Fix following warning:
      WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node()
      
      static sparse_early_usemap_alloc() were used only by sparse_init()
      and with sparse_init() annotated _init it is safe to
      annotate sparse_early_usemap_alloc with __init too.
      Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a322f8ab
    • N
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin 提交于
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
    • S
      page migraton: handle orphaned pages · 62e1c553
      Shaohua Li 提交于
      Orphaned page might have fs-private metadata, the page is truncated.  As
      the page hasn't mapping, page migration refuse to migrate the page.  It
      appears the page is only freed in page reclaim and if zone watermark is
      low, the page is never freed, as a result migration always fail.  I thought
      we could free the metadata so such page can be freed in migration and make
      migration more reliable.
      
      [akpm@linux-foundation.org: go direct to try_to_free_buffers()]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62e1c553
    • M
      check ADVICE of fadvise64_64 even if get_xip_page is given · b5beb1ca
      Masatake YAMATO 提交于
      I've written some test programs in ltp project.  During writing I met an
      problem which I cannot solve in user land.  So I wrote a patch for linux
      kernel.  Please, include this patch if acceptable.
      
      The test program tests the 4th parameter of fadvise64_64:
      
          long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
      
      My test case calls fadvise64_64 with invalid advice value and checks errno is
      set to EINVAL.  About the advice parameter man page says:
      
          ...
          Permissible values for advice include:
      
      	   POSIX_FADV_NORMAL
                        ...
      	   POSIX_FADV_SEQUENTIAL
                        ...
      	   POSIX_FADV_RANDOM
      		  ...
      	   POSIX_FADV_NOREUSE
                        ...
      	   POSIX_FADV_WILLNEED
                        ...
      	   POSIX_FADV_DONTNEED
      		  ...
          ERRORS
                 ...
      	   EINVAL An invalid value was specified for advice.
      
      However, I got a bug report that the system call invocations
      in my test case returned 0 unexpectedly.
      
      I've inspected the kernel code:
      
          asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
          {
      	    struct file *file = fget(fd);
      	    struct address_space *mapping;
      	    struct backing_dev_info *bdi;
      	    loff_t endbyte;			/* inclusive */
      	    pgoff_t start_index;
      	    pgoff_t end_index;
      	    unsigned long nrpages;
      	    int ret = 0;
      
      	    if (!file)
      		    return -EBADF;
      
      	    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
      		    ret = -ESPIPE;
      		    goto out;
      	    }
      
      	    mapping = file->f_mapping;
      	    if (!mapping || len < 0) {
      		    ret = -EINVAL;
      		    goto out;
      	    }
      
      	    if (mapping->a_ops->get_xip_page)
      		    /* no bad return value, but ignore advice */
      		    goto out;
          ...
          out:
      	    fput(file);
      	    return ret;
          }
      
      I found the advice parameter is just ignored in the case
      mapping->a_ops->get_xip_page is given. This behavior is different from
      what is written on the man page. Is this o.k.?
      
      get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
      Anyway I cannot find the easy way to detect get_xip_page
      field is given or CONFIG_EXT2_FS_XIP is true from the
      user space.
      
      I propose the following patch which checks the advice parameter
      even if get_xip_page is given.
      Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5beb1ca
    • L
      Include count of pagecache pages in show_mem() output · e6f3602d
      Larry Woodman 提交于
      The show_mem() output does not include the total number of pagecache
      pages.  This would be helpful when analyzing the debug information in
      the /var/log/messages file after OOM kills occur.
      
      This patch includes the total pagecache pages in that output.
      Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6f3602d
    • B
      Fix dirty page accounting leak with ext3 data=journal · a2b34564
      Bjorn Steinbrink 提交于
      In 46d2277c ("Clean up and make
      try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
      was changed to bail out if the page was dirty.
      
      That in turn caused truncate_complete_page to leak massive amounts of
      memory, because the dirty bit was only cleared after the call to
      try_to_free_buffers.
      
      So the call to cancel_dirty_page was moved up to have the dirty bit
      cleared early in 3e67c098 ("truncate:
      clear page dirtiness before running try_to_free_buffers()").
      
      The problem with that fix is, that the page can be redirtied after
      cancel_dirty_page was called, eg. like this:
      
      truncate_complete_page()
        cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
        do_invalidatepage()
          ext3_invalidatepage()
            journal_invalidatepage()
              journal_unmap_buffer()
                __dispose_buffer()
                  __journal_unfile_buffer()
                    __journal_temp_unlink_buffer()
                      mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
      
      And then we end up with dirty pages being wrongly accounted.
      
      As a result, in ecdfc978 ("Resurrect
      'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
      were reverted, so the original reason for the massive memory leak is
      gone, and we can also revert the move of the call to cancel_dirty_page
      from truncate_complete_page and get the accounting right again.
      
      I'm not sure if it matters, but opposed to the final check in
      __remove_from_page_cache, this one also cares about the task io
      accounting, so maybe we want to use this instead, although it's not
      quite the clean fix either.
      Signed-off-by: NBjörn Steinbrink <B.Steinbrink@gmx.de>
      Tested-by: NKrzysztof Piotr Oledzki <ole@ans.pl>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Osterried <osterried@jesse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2b34564
    • Q
      set_page_refcounted() VM_BUG_ON fix · ae1276b9
      Qi Yong 提交于
      The current PageTail semantic is that a PageTail page is first a
      PageCompound page.  So remove the redundant PageCompound test in
      set_page_refcounted().
      Signed-off-by: NQi Yong <qiyong@fc-cn.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae1276b9
    • H
      mm: remove fastcall from mm/ · 920c7a5d
      Harvey Harrison 提交于
      fastcall is always defined to be empty, remove it
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      920c7a5d
    • A
    • H
      mm: don't waste swap on locked pages · 5a9bbdcd
      Hugh Dickins 提交于
      try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
      migrating), and recycles it back to the active list.  But if it's an
      anonymous page, we've already allocated swap to it: just wasting swap.
      Spot locked pages in page_referenced_one and treat them as referenced.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ethan Solomita <solo@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a9bbdcd
    • C
      vmstat: remove prefetch · 9eccf2a8
      Christoph Lameter 提交于
      Remove the prefetch logic in order to avoid touching impossible per cpu
      areas.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9eccf2a8