1. 25 7月, 2008 5 次提交
    • M
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman 提交于
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • M
      hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() · a1e78772
      Mel Gorman 提交于
      This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
      a similar manner to the reservations taken for MAP_SHARED mappings.  The
      reserve count is accounted both globally and on a per-VMA basis for
      private mappings.  This guarantees that a process that successfully calls
      mmap() will successfully fault all pages in the future unless fork() is
      called.
      
      The characteristics of private mappings of hugetlbfs files behaviour after
      this patch are;
      
      1. The process calling mmap() is guaranteed to succeed all future faults until
         it forks().
      2. On fork(), the parent may die due to SIGKILL on writes to the private
         mapping if enough pages are not available for the COW. For reasonably
         reliable behaviour in the face of a small huge page pool, children of
         hugepage-aware processes should not reference the mappings; such as
         might occur when fork()ing to exec().
      3. On fork(), the child VMAs inherit no reserves. Reads on pages already
         faulted by the parent will succeed. Successful writes will depend on enough
         huge pages being free in the pool.
      4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
         and at fault time otherwise.
      
      Before this patch, all reads or writes in the child potentially needs page
      allocations that can later lead to the death of the parent.  This applies
      to reads and writes of uninstantiated pages as well as COW.  After the
      patch it is only a write to an instantiated page that causes problems.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1e78772
    • K
      fix soft lock up at NFS mount via per-SB LRU-list of unused dentries · da3bbdd4
      Kentaro Makita 提交于
      [Summary]
      
       Split LRU-list of unused dentries to one per superblock to avoid soft
       lock up during NFS mounts and remounting of any filesystem.
      
       Previously I posted here:
       http://lkml.org/lkml/2008/3/5/590
      
      [Descriptions]
      
      - background
      
        dentry_unused is a list of dentries which are not referenced.
        dentry_unused grows up when references on directories or files are
        released.  This list can be very long if there is huge free memory.
      
      - the problem
      
        When shrink_dcache_sb() is called, it scans all dentry_unused linearly
        under spin_lock(), and if dentry->d_sb is differnt from given
        superblock, scan next dentry.  This scan costs very much if there are
        many entries, and very ineffective if there are many superblocks.
      
        IOW, When we need to shrink unused dentries on one dentry, but scans
        unused dentries on all superblocks in the system.  For example, we scan
        500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
        dentries on other superblocks.
      
        In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
        unused dentries on NFS, but scans 100,000,000 unused dentries on
        superblocks in the system such as local ext3 filesystems.  I hear NFS
        mounting took 1 min on some system in use.
      
      * : NFS uses virtual filesystem in rpc layer, so NFS is affected by
        this problem.
      
        100,000,000 is possible number on large systems.
      
        Per-superblock LRU of unused dentried can reduce the cost in
        reasonable manner.
      
      - How to fix
      
        I found this problem is solved by David Chinner's "Per-superblock
        unused dentry LRU lists V3"(1), so I rebase it and add some fix to
        reclaim with fairness, which is in Andrew Morton's comments(2).
      
        1) http://lkml.org/lkml/2006/5/25/318
        2) http://lkml.org/lkml/2006/5/25/320
      
        Split LRU-list of unused dentries to each superblocks.  Then, NFS
        mounting will check dentries under a superblock instead of all.  But
        this spliting will break LRU of dentry-unused.  So, I've attempted to
        make reclaim unused dentrins with fairness by calculate number of
        dentries to scan on this sb based on following way
      
        number of dentries to scan on this sb =
        count * (number of dentries on this sb / number of dentries in the machine)
      
      - ToDo
       - I have to measuring performance number and do stress tests.
      
       - When unmount occurs during prune_dcache(), scanning on same
        superblock, It is unable to reach next superblock because it is gone
        away.  We restart scannig superblock from first one, it causes
        unfairness of reclaim unused dentries on first superblock.  But I think
        this happens very rarely.
      
      - Test Results
      
        Result on 6GB boxes with excessive unused dentries.
      
      Without patch:
      
      $ cat /proc/sys/fs/dentry-state
      10181835        10180203        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m1.830s
      user    0m0.001s
      sys     0m1.653s
      
       With this patch:
      $ cat /proc/sys/fs/dentry-state
      10236610        10234751        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m0.106s
      user    0m0.002s
      sys     0m0.032s
      
      [akpm@linux-foundation.org: fix comments]
      Signed-off-by: NKentaro Makita <k-makita@np.css.fujitsu.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Chinner <dgc@sgi.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da3bbdd4
    • J
      mm: remove double indirection on tlb parameter to free_pgd_range() & Co · 42b77728
      Jan Beulich 提交于
      The double indirection here is not needed anywhere and hence (at least)
      confusing.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42b77728
    • A
      mm/vmstat.c: proper externs · c748e134
      Adrian Bunk 提交于
      This patch adds proper extern declarations for five variables in
      include/linux/vmstat.h
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c748e134
  2. 23 7月, 2008 3 次提交
    • A
      netns: make get_proc_net() static · 8086cd45
      Adrian Bunk 提交于
      get_proc_net() can now become static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8086cd45
    • A
      proc: fix /proc/*/pagemap some more · ee1e6ab6
      Alexey Dobriyan 提交于
      struct pagemap_walk was placed on stack, some hooks are initialized, the
      rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk.
      Reported-by: NEric Sesterhenn <snakebyte@gmx.de>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
      Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee1e6ab6
    • J
      execve filename: document and export via auxiliary vector · 65191087
      John Reiser 提交于
      The Linux kernel puts the filename argument of execve() into the new
      address space.  Many developers are surprised to learn this.  Those who
      know and could use it, object "But it's not documented."
      
      Those who want to use it dislike the expression
        (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
      because it requires locating the last original environment variable,
      and assumes that the filename follows the characters.
      
      This patch documents the insertion of the filename, and makes it easier
      to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.
      
      In many cases readlink("/proc/self/exe",) gives the same answer.  But if
      all the original pages get unmapped, then the kernel erases the symlink
      for /proc/self/exe.  This can happen when a program decompressor does a
      good job of cleaning up after uncompressing directly to memory, so that
      the address space of the target program looks the same as if compression
      had never happened.  One example is http://upx.sourceforge.net .
      
      One notable use of the underlying concept (what path containED the
      executable) is glibc expanding $ORIGIN in DT_RUNPATH.  In practice for
      the near term, it may be a good idea for user-mode code to use both
      /proc/self/exe and AT_EXECFN as fall-back methods for each other.
      /proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
      won't be present on non-new systems.  The auxvec or {AT_EXECFN}.d_val
      also can get overwritten, although in nearly all cases this would be the
      result of a bug.
      
      The runtime cost is one NEW_AUX_ENT using two words of stack space.  The
      underlying value is maintained already as bprm->exec; setup_arg_pages()
      in fs/exec.c slides it for stack_shift, etc.
      Signed-off-by: NJohn Reiser <jreiser@BitWagon.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65191087
  3. 22 7月, 2008 5 次提交
  4. 21 7月, 2008 1 次提交
    • A
      tty: Ldisc revamp · a352def2
      Alan Cox 提交于
      Move the line disciplines towards a conventional ->ops arrangement.  For
      the moment the actual 'tty_ldisc' struct in the tty is kept as part of
      the tty struct but this can then be changed if it turns out that when it
      all settles down we want to refcount ldiscs separately to the tty.
      
      Pull the ldisc code out of /proc and put it with our ldisc code.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a352def2
  5. 19 7月, 2008 2 次提交
  6. 18 7月, 2008 4 次提交
  7. 17 7月, 2008 2 次提交
    • C
      [PATCH] ocfs2: fix oops in mmap_truncate testing · c0420ad2
      Coly Li 提交于
      This patch fixes a mmap_truncate bug which was found by ocfs2 test suite.
      
      In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races
      mmap writes and truncates from multiple processes. While the test is
      running, a stat from another node forces writeout, causing an oops in
      ocfs2_get_block() because it sees a buffer to write which isn't allocated.
      
      This patch fixed the bug by clear dirty and uptodate bits in buffer, leave
      the buffer unmapped and return.
      
      Fix is suggested by Mark Fasheh, and I code up the patch.
      Signed-off-by: NColy Li <coyli@suse.de>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      c0420ad2
    • R
      Fix compile issues in fs/compat_ioctl.c when CONFIG_BLOCK is disabled · 3c3622dc
      Randy Dunlap 提交于
      Fix fs/compat_ioctl.c to handle CONFIG_BLOCK=n, CONFIG_SCSI=n to avoid
      build errors:
      
      In file included from include/scsi/scsi.h:12,
                       from fs/compat_ioctl.c:71:
      include/scsi/scsi_cmnd.h:27:25: warning: "BLK_MAX_CDB" is not defined
      include/scsi/scsi_cmnd.h:28:3: error: #error MAX_COMMAND_SIZE can not be bigger than BLK_MAX_CDB
      In file included from include/scsi/scsi.h:12,
                       from fs/compat_ioctl.c:71:
      include/scsi/scsi_cmnd.h: In function 'scsi_bidi_cmnd':
      include/scsi/scsi_cmnd.h:182: error: implicit declaration of function 'blk_bidi_rq'
      include/scsi/scsi_cmnd.h:183: error: dereferencing pointer to incomplete type
      include/scsi/scsi_cmnd.h: In function 'scsi_in':
      include/scsi/scsi_cmnd.h:189: error: dereferencing pointer to incomplete type
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c3622dc
  8. 16 7月, 2008 18 次提交