1. 20 7月, 2007 25 次提交
    • M
      fix ext4/JBD2 build warnings · b38bd33a
      Mingming Cao 提交于
      Looking at the current linus-git tree jbd_debug() define in
      include/linux/jbd2.h
      
      extern u8 journal_enable_debug;
      
      #define jbd_debug(n, f, a...)                                           \
              do {                                                            \
                      if ((n) <= journal_enable_debug) {                      \
                              printk (KERN_DEBUG "(%s, %d): %s: ",            \
                                      __FILE__, __LINE__, __FUNCTION__);      \
                              printk (f, ## a);                               \
                      }                                                       \
              } while (0)
      > fs/ext4/inode.c: In function ‘ext4_write_inode’:
      > fs/ext4/inode.c:2906: warning: comparison is always true due to limited
      > range of data type
      >
      > fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’:
      > fs/jbd2/recovery.c:254: warning: comparison is always true due to
      > limited range of data type
      > fs/jbd2/recovery.c:257: warning: comparison is always true due to
      > limited range of data type
      >
      > fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’:
      > fs/jbd2/recovery.c:301: warning: comparison is always true due to
      > limited range of data type
      >
      Noticed all warnings are occurs when the debug level is 0. Then found
      the "jbd2: Move jbd2-debug file to debugfs" patch
      http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b
      
      changed the jbd2_journal_enable_debug from int type to u8, makes the
      jbd_debug comparision is always true when the debugging level is 0. Thus
      the compile warning occurs.
      
      Thought about changing the jbd2_journal_enable_debug data type back to
      int, but can't, because the jbd2-debug is moved to debug fs, where
      calling debugfs_create_u8() to create the debugfs entry needs the value
      to be u8 type.
      
      Even if we changed the data type back to int, the code is still buggy,
      kernel should not print jbd2 debug message if the
      jbd2_journal_enable_debug is set to 0. But this is not the case.
      
      The fix is change the level of debugging to 1. The same should fixed in
      ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we
      probably should fix it all together.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Theodore Tso <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b38bd33a
    • K
      coredump masking: ELF-FDPIC: enable core dump filtering · ee78b0a6
      Kawai, Hidehiro 提交于
      This patch enables core dump filtering for ELF-FDPIC-formatted core file.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee78b0a6
    • K
      coredump masking: ELF-FDPIC: remove an unused argument · e2e00906
      Kawai, Hidehiro 提交于
      This patch removes an unused argument from elf_fdpic_dump_segments().
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2e00906
    • K
      coredump masking: ELF: enable core dump filtering · a1b59e80
      Kawai, Hidehiro 提交于
      This patch enables core dump filtering for ELF-formatted core file.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1b59e80
    • K
      coredump masking: add an interface for core dump filter · 3cb4a0bb
      Kawai, Hidehiro 提交于
      This patch adds an interface to set/reset flags which determines each memory
      segment should be dumped or not when a core file is generated.
      
      /proc/<pid>/coredump_filter file is provided to access the flags.  You can
      change the flag status for a particular process by writing to or reading from
      the file.
      
      The flag status is inherited to the child process when it is created.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cb4a0bb
    • K
      coredump masking: reimplementation of dumpable using two flags · 6c5d5238
      Kawai, Hidehiro 提交于
      This patch changes mm_struct.dumpable to a pair of bit flags.
      
      set_dumpable() converts three-value dumpable to two flags and stores it into
      lower two bits of mm_struct.flags instead of mm_struct.dumpable.
      get_dumpable() behaves in the opposite way.
      
      [akpm@linux-foundation.org: export set_dumpable]
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c5d5238
    • J
      fs: remove path_walk export · f79c20f5
      Josef 'Jeff' Sipek 提交于
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f79c20f5
    • J
      fs: mark link_path_walk static · c4a7808f
      Josef 'Jeff' Sipek 提交于
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4a7808f
    • J
      nfsctl: use vfs_path_lookup · 16b6287a
      Josef 'Jeff' Sipek 提交于
      use vfs_path_lookup instead of open-coding the necessary functionality.
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Acked-by: NNeilBrown <neilb@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16b6287a
    • J
      fs: introduce vfs_path_lookup · 16f18200
      Josef 'Jeff' Sipek 提交于
      Stackable file systems, among others, frequently need to lookup paths or
      path components starting from an arbitrary point in the namespace
      (identified by a dentry and a vfsmount).  Currently, such file systems use
      lookup_one_len, which is frowned upon [1] as it does not pass the lookup
      intent along; not passing a lookup intent, for example, can trigger BUG_ON's
      when stacking on top of NFSv4.
      
      The first patch introduces a new lookup function to allow lookup starting
      from an arbitrary point in the namespace.  This approach has been suggested
      by Christoph Hellwig [2].
      
      The second patch changes sunrpc to use vfs_path_lookup.
      
      The third patch changes nfsctl.c to use vfs_path_lookup.
      
      The fourth patch marks link_path_walk static.
      
      The fifth, and last patch, unexports path_walk because it is no longer
      unnecessary to call it directly, and using the new vfs_path_lookup is
      cleaner.
      
      For example, the following snippet of code, looks up "some/path/component"
      in a directory pointed to by parent_{dentry,vfsmnt}:
      
      err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
      		      "some/path/component", 0, &nd);
      if (!err) {
      	/* exits */
      
      	...
      
      	/* once done, release the references */
      	path_release(&nd);
      } else if (err == -ENOENT) {
      	/* doesn't exist */
      } else {
      	/* other error */
      }
      
      VFS functions such as lookup_create can be used on the nameidata structure
      to pass the create intent to the file system.
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16f18200
    • O
      mm: variable length argument support · b6a2fea3
      Ollie Wild 提交于
      Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
      the old mm into the new mm.
      
      We create the new mm before the binfmt code runs, and place the new stack at
      the very top of the address space.  Once the binfmt code runs and figures out
      where the stack should be, we move it downwards.
      
      It is a bit peculiar in that we have one task with two mm's, one of which is
      inactive.
      
      [a.p.zijlstra@chello.nl: limit stack size]
      Signed-off-by: NOllie Wild <aaw@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      [bunk@stusta.de: unexport bprm_mm_init]
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a2fea3
    • P
      audit: rework execve audit · bdf4c48a
      Peter Zijlstra 提交于
      The purpose of audit_bprm() is to log the argv array to a userspace daemon at
      the end of the execve system call.  Since user-space hasn't had time to run,
      this array is still in pristine state on the process' stack; so no need to
      copy it, we can just grab it from there.
      
      In order to minimize the damage to audit_log_*() copy each string into a
      temporary kernel buffer first.
      
      Currently the audit code requires that the full argument vector fits in a
      single packet.  So currently it does clip the argv size to a (sysctl) limit,
      but only when execve auditing is enabled.
      
      If the audit protocol gets extended to allow for multiple packets this check
      can be removed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NOllie Wild <aaw@google.com>
      Cc: <linux-audit@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdf4c48a
    • R
      readahead: split ondemand readahead interface into two functions · cf914a7d
      Rusty Russell 提交于
      Split ondemand readahead interface into two functions.  I think this makes it
      a little clearer for non-readahead experts (like Rusty).
      
      Internally they both call ondemand_readahead(), but the page argument is
      changed to an obvious boolean flag.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf914a7d
    • F
      readahead: pass real splice size · d8983910
      Fengguang Wu 提交于
      Pass real splice size to page_cache_readahead_ondemand().
      
      The splice code works in chunks of 16 pages internally.  The readahead code
      should be told of the overall splice size, instead of the internal chunk size.
       Otherwize bad things may happen.  Imagine some 17-page random splice reads.
      The code before this patch will result in two readahead calls: readahead(16);
      readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O
      and 31 readahead miss pages.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8983910
    • F
      readahead: move synchronous readahead call out of splice loop · 431a4820
      Fengguang Wu 提交于
      Move synchronous page_cache_readahead_ondemand() call out of splice loop.
      
      This avoids one pointless page allocation/insertion in case of non-zero
      ra_pages, or many pointless readahead calls in case of zero ra_pages.
      
      Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will
      not get expected readahead behavior anyway.  The splice code works in batches
      of 16 pages, which can be taken as another form of synchronous readahead.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      431a4820
    • F
      readahead: convert ext3/ext4 invocations · dc7868fc
      Fengguang Wu 提交于
      Convert ext3/ext4 dir reads to use on-demand readahead.
      
      Readahead for dirs operates _not_ on file level, but on blockdev level.  This
      makes a difference when the data blocks are not continuous.  And the read
      routine is somehow opaque: there's no handy info about the status of current
      page.  So a simplified call scheme is employed: to call into readahead
      whenever the current page falls out of readahead windows.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc7868fc
    • F
      readahead: convert splice invocations · a08a166f
      Fengguang Wu 提交于
      Convert splice reads to use on-demand readahead.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Jens Axboe <axboe@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a08a166f
    • M
      eCryptfs: ecryptfs_setattr() bugfix · 64ee4808
      Michael Halcrow 提交于
      There is another bug recently introduced into the ecryptfs_setattr()
      function in 2.6.22.  eCryptfs will attempt to treat special files like
      regular eCryptfs files on chmod, chown, and so forth.  This leads to a NULL
      pointer dereference.  This patch validates that the file is a regular file
      before proceeding with operations related to the inode's crypt_stat.
      
      Thanks to Ryusuke Konishi for finding this bug and suggesting the fix.
      Signed-off-by: NMichael Halcrow <mhalcrow@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64ee4808
    • R
      Avoid too many remote cpu references due to /proc/stat · 4004c69a
      Ravikiran G Thirumalai 提交于
      Optimize show_stat to collect per-irq information just once.
      
      On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
      On every call to kstat_irqs, the process brings in per-cpu data from all
      online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
      results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
      Considering the fact that we already compute this value per-cpu, we can
      save on the remote references as below.
      Signed-off-by: NAlok N Kataria <alok.kataria@calsoftinc.com>
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4004c69a
    • A
      unregister_chrdev() return void · e53252d9
      Akinobu Mita 提交于
      unregister_chrdev() does not return meaningful value.  This patch makes it
      return void like most unregister_* functions.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e53252d9
    • C
      UDF: coding style conversion - lindent · cb00ea35
      Cyrill Gorcunov 提交于
      This patch converts UDF coding style to kernel coding style using Lindent.
      Signed-off-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Cc: Jan Kara <jack@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb00ea35
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
    • N
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin 提交于
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821
    • N
      mm: fix fault vs invalidate race for linear mappings · d00806b1
      Nick Piggin 提交于
      Fix the race between invalidate_inode_pages and do_no_page.
      
      Andrea Arcangeli identified a subtle race between invalidation of pages from
      pagecache with userspace mappings, and do_no_page.
      
      The issue is that invalidation has to shoot down all mappings to the page,
      before it can be discarded from the pagecache.  Between shooting down ptes to
      a particular page, and actually dropping the struct page from the pagecache,
      do_no_page from any process might fault on that page and establish a new
      mapping to the page just before it gets discarded from the pagecache.
      
      The most common case where such invalidation is used is in file truncation.
      This case was catered for by doing a sort of open-coded seqlock between the
      file's i_size, and its truncate_count.
      
      Truncation will decrease i_size, then increment truncate_count before
      unmapping userspace pages; do_no_page will read truncate_count, then find the
      page if it is within i_size, and then check truncate_count under the page
      table lock and back out and retry if it had subsequently been changed (ptl
      will serialise against unmapping, and ensure a potentially updated
      truncate_count is actually visible).
      
      Complexity and documentation issues aside, the locking protocol fails in the
      case where we would like to invalidate pagecache inside i_size.  do_no_page
      can come in anytime and filemap_nopage is not aware of the invalidation in
      progress (as it is when it is outside i_size).  The end result is that
      dangling (->mapping == NULL) pages that appear to be from a particular file
      may be mapped into userspace with nonsense data.  Valid mappings to the same
      place will see a different page.
      
      Andrea implemented two working fixes, one using a real seqlock, another using
      a page->flags bit.  He also proposed using the page lock in do_no_page, but
      that was initially considered too heavyweight.  However, it is not a global or
      per-file lock, and the page cacheline is modified in do_no_page to increment
      _count and _mapcount anyway, so a further modification should not be a large
      performance hit.  Scalability is not an issue.
      
      This patch implements this latter approach.  ->nopage implementations return
      with the page locked if it is possible for their underlying file to be
      invalidated (in that case, they must set a special vm_flags bit to indicate
      so).  do_no_page only unlocks the page after setting up the mapping
      completely.  invalidation is excluded because it holds the page lock during
      invalidation of each page (and ensures that the page is not mapped while
      holding the lock).
      
      This also allows significant simplifications in do_no_page, because we have
      the page locked in the right place in the pagecache from the start.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d00806b1
  2. 19 7月, 2007 15 次提交