1. 20 7月, 2007 40 次提交
    • S
      user namespace: fix copy_user_ns return value · 626ac545
      Serge E. Hallyn 提交于
      When a CONFIG_USER_NS=n and a user tries to unshare some namespace other
      than the user namespace, the dummy copy_user_ns returns NULL rather than
      the old_ns.
      
      This value then gets assigned to task->nsproxy->user_ns, so that a
      subsequent setuid, which uses task->nsproxy->user_ns, causes a NULL
      pointer deref.
      
      Fix this by returning old_ns.
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626ac545
    • I
      [PATCH] sched: implement cpu_clock(cpu) high-speed time source · e436d800
      Ingo Molnar 提交于
      Implement the cpu_clock(cpu) interface for kernel-internal use:
      high-speed (but slightly incorrect) per-cpu clock constructed from
      sched_clock().
      
      This API, unused at the moment, will be used in the future by blktrace,
      by the softlockup-watchdog, by printk and by lockstat.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e436d800
    • R
      [PATCH] sched: sched_cacheflush is now unused · c41917df
      Ralf Baechle 提交于
      Since Ingo's recent scheduler rewrite which was merged as commit
      0437e109 sched_cacheflush is unused.
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c41917df
    • D
      include/linux/pci_id.h: add amd northbridge defines · 53078ca8
      Douglas Thompson 提交于
      pci_ids.h needs two of the AMD NB device-ids namely, Addressmap and the Memory
      Controller devices
      
      This patch adds those to the pci_id.h include file
      Signed-off-by: NDouglas Thompson <dougthompson@xmission.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53078ca8
    • D
      drivers/edac: mod assert_error check · 66ee2f94
      Dave Jiang 提交于
      Change error check and clear variable from an atomic to an int
      Signed-off-by: NDave Jiang <djiang@mvista.com>
      Signed-off-by: Douglas Thompson <dougthompson@xmission.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66ee2f94
    • J
      drivers/edac: new inte 30x0 MC driver · 535c6a53
      Jason Uhlenkott 提交于
      Here's a driver for the Intel 3000 and 3010 memory controllers,
      relative to today's Sourceforge code drop.  This has only had light
      testing (I've yet to actually see it handle a memory error) but it
      detects my hardware correctly.
      Signed-off-by: NJason Uhlenkott <juhlenko@akamai.com>
      Signed-off-by: NDouglas Thompson <dougthompson@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535c6a53
    • D
      drivers/edac: add new nmi rescan · c0d12172
      Dave Jiang 提交于
      Provides a way for NMI reported errors on x86 to notify the EDAC
      subsystem pending ECC errors by writing to a software state variable.
      
      Here's the reworked patch. I added an EDAC stub to the kernel so we can
      have variables that are in the kernel even if EDAC is a module. I also
      implemented the idea of using the chip driver to select error detection
      mode via module parameter and eliminate the kernel compile option.
      Please review/test. Thx!
      
      Also, I only made changes to some of the chipset drivers since I am
      unfamiliar with the other ones. We can add similar changes as we go.
      Signed-off-by: NDave Jiang <djiang@mvista.com>
      Signed-off-by: NDouglas Thompson <dougthompson@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0d12172
    • R
      lguest: the host code · d7e28ffe
      Rusty Russell 提交于
      This is the code for the "lg.ko" module, which allows lguest guests to
      be launched.
      
      [akpm@linux-foundation.org: update for futex-new-private-futexes]
      [akpm@linux-foundation.org: build fix]
      [jmorris@namei.org: lguest: use hrtimers]
      [akpm@linux-foundation.org: x86_64 build fix]
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7e28ffe
    • R
      lguest: the guest code · 07ad157f
      Rusty Russell 提交于
      lguest is a simple hypervisor for Linux on Linux.  Unlike kvm it doesn't need
      VT/SVM hardware.  Unlike Xen it's simply "modprobe and go".  Unlike both, it's
      5000 lines and self-contained.
      
      Performance is ok, but not great (-30% on kernel compile).  But given its
      hackability, I expect this to improve, along with the paravirt_ops code which
      it supplies a complete example for.  There's also a 64-bit version being
      worked on and other craziness.
      
      But most of all, lguest is awesome fun!  Too much of the kernel is a big ball
      of hair.  lguest is simple enough to dive into and hack, plus has some warts
      which scream "fork me!".
      
      This patch:
      
      This is the code and headers required to make an i386 kernel an lguest guest.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07ad157f
    • J
      knfsd: clean up EX_RDONLY · c7d51402
      J. Bruce Fields 提交于
      Share a little common code, reverse the arguments for consistency, drop the
      unnecessary "inline", and lowercase the name.
      Signed-off-by: N"J. Bruce Fields" <bfields@citi.umich.edu>
      Acked-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7d51402
    • J
      knfsd: move EX_RDONLY out of header · e22841c6
      J. Bruce Fields 提交于
      EX_RDONLY is only called in one place; just put it there.
      Signed-off-by: N"J. Bruce Fields" <bfields@citi.umich.edu>
      Acked-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e22841c6
    • A
      move page writeback acounting out of macros · d688abf5
      Andrew Morton 提交于
      page-writeback accounting is presently performed in the page-flags macros.
      This is inconsistent and a bit ugly and makes it awkward to implement
      per-backing_dev under-writeback page accounting.
      
      So move this accounting down to the callsite(s).
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d688abf5
    • G
      m68knommu: remove is_in_rom() function · 10146801
      Greg Ungerer 提交于
      Remove is_in_rom() function.  It doesn't actually serve the purpose it was
      intended to.  If you look at the use of it _access_ok() (which is the only use
      of it) then it is obvious that most of memory is marked as access_ok.  No
      point having is_in_rom() then, so remove it.
      Signed-off-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10146801
    • G
      m68knommu: generic irq handling · 2502b667
      Greg Ungerer 提交于
      Change the m68knommu irq handling to use the generic irq framework.
      Signed-off-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2502b667
    • J
      stacktrace: fix header file for !CONFIG_STACKTRACE · 3b5ad079
      Johannes Berg 提交于
      The print_stack_trace macro in stacktrace.h has a wrong number of
      arguments, fix it.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b5ad079
    • P
      lockstat: measure lock bouncing · 96645678
      Peter Zijlstra 提交于
          __acquire
              |
             lock _____
              |        \
              |    __contended
              |         |
              |        wait
              | _______/
              |/
              |
         __acquired
              |
         __release
              |
           unlock
      
      We measure acquisition and contention bouncing.
      
      This is done by recording a cpu stamp in each lock instance.
      
      Contention bouncing requires the cpu stamp to be set on acquisition. Hence we
      move __acquired into the generic path.
      
      __acquired is then used to measure acquisition bouncing by comparing the
      current cpu with the old stamp before replacing it.
      
      __contended is used to measure contention bouncing (only useful for preemptable
      locks)
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96645678
    • P
      lockdep: various fixes · 4b32d0a4
      Peter Zijlstra 提交于
       - update the copyright notices
       - use the default hash function
       - fix a thinko in a BUILD_BUG_ON
       - add a WARN_ON to spot inconsitent naming
       - fix a termination issue in /proc/lock_stat
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b32d0a4
    • P
      lockstat: core infrastructure · f20786ff
      Peter Zijlstra 提交于
      Introduce the core lock statistics code.
      
      Lock statistics provides lock wait-time and hold-time (as well as the count
      of corresponding contention and acquisitions events). Also, the first few
      call-sites that encounter contention are tracked.
      
      Lock wait-time is the time spent waiting on the lock. This provides insight
      into the locking scheme, that is, a heavily contended lock is indicative of
      a too coarse locking scheme.
      
      Lock hold-time is the duration the lock was held, this provides a reference for
      the wait-time numbers, so they can be put into perspective.
      
        1)
          lock
        2)
          ... do stuff ..
          unlock
        3)
      
      The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
      hold-time.
      
      The lockdep held-lock tracking code is reused, because it already collects locks
      into meaningful groups (classes), and because it is an existing infrastructure
      for lock instrumentation.
      
      Currently lockdep tracks lock acquisition with two hooks:
      
        lock()
          lock_acquire()
          _lock()
      
       ... code protected by lock ...
      
        unlock()
          lock_release()
          _unlock()
      
      We need to extend this with two more hooks, in order to measure contention.
      
        lock_contended() - used to measure contention events
        lock_acquired()  - completion of the contention
      
      These are then placed the following way:
      
        lock()
          lock_acquire()
          if (!_try_lock())
            lock_contended()
            _lock()
            lock_acquired()
      
       ... do locked stuff ...
      
        unlock()
          lock_release()
          _unlock()
      
      (Note: the try_lock() 'trick' is used to avoid instrumenting all platform
             dependent lock primitive implementations.)
      
      It is also possible to toggle the two lockdep features at runtime using:
      
        /proc/sys/kernel/prove_locking
        /proc/sys/kernel/lock_stat
      
      (esp. turning off the O(n^2) prove_locking functionaliy can help)
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: nuke unneeded ifdefs]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f20786ff
    • P
      fix raw_spinlock_t vs lockdep · 21f8ca3b
      Peter Zijlstra 提交于
      Use the lockdep infrastructure to track lock contention and other lock
      statistics.
      
      It tracks lock contention events, and the first four unique call-sites that
      encountered contention.
      
      It also measures lock wait-time and hold-time in nanoseconds. The minimum and
      maximum times are tracked, as well as a total (which together with the number
      of event can give the avg).
      
      All statistics are done per lock class, per write (exclusive state) and per read
      (shared state).
      
      The statistics are collected per-cpu, so that the collection overhead is
      minimized via having no global cachemisses.
      
      This new lock statistics feature is independent of the lock dependency checking
      traditionally done by lockdep; it just shares the lock tracking code. It is
      also possible to enable both and runtime disabled either component - thereby
      avoiding the O(n^2) lock chain walks for instance.
      
      This patch:
      
      raw_spinlock_t should not use lockdep (and doesn't) since lockdep itself
      relies on it.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21f8ca3b
    • J
      coda: remove statistics counters from /proc/fs/coda · 3cf01f28
      Jan Harkes 提交于
      Similar information can easily be obtained with strace -c.
      Signed-off-by: NJan Harkes <jaharkes@cs.cmu.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cf01f28
    • J
      coda: remove struct coda_sb_info · a1b0aa87
      Jan Harkes 提交于
      The sb_info structure only contains a single pointer to the character device,
      there is no need for the added indirection.
      Signed-off-by: NJan Harkes <jaharkes@cs.cmu.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1b0aa87
    • J
      coda: block signals during upcall processing · d9664c95
      Jan Harkes 提交于
      We ignore signals for about 30 seconds to give userspace a chance to see the
      upcall.  As we did not block signals we ended up in a busy loop for the
      remainder of the period when a signal is received.
      Signed-off-by: NJan Harkes <jaharkes@cs.cmu.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9664c95
    • R
      i386: Put allocated ELF notes in read-only data segment · cbe87121
      Roland McGrath 提交于
      This changes the i386 linker script and the asm-generic macro it uses so that
      ELF note sections with SHF_ALLOC set are linked into the kernel image along
      with other read-only data.  The PT_NOTE also points to their location.
      
      This paves the way for putting useful build-time information into ELF notes
      that can be found easily later in a kernel memory dump.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbe87121
    • K
      coredump masking: add an interface for core dump filter · 3cb4a0bb
      Kawai, Hidehiro 提交于
      This patch adds an interface to set/reset flags which determines each memory
      segment should be dumped or not when a core file is generated.
      
      /proc/<pid>/coredump_filter file is provided to access the flags.  You can
      change the flag status for a particular process by writing to or reading from
      the file.
      
      The flag status is inherited to the child process when it is created.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cb4a0bb
    • K
      coredump masking: reimplementation of dumpable using two flags · 6c5d5238
      Kawai, Hidehiro 提交于
      This patch changes mm_struct.dumpable to a pair of bit flags.
      
      set_dumpable() converts three-value dumpable to two flags and stores it into
      lower two bits of mm_struct.flags instead of mm_struct.dumpable.
      get_dumpable() behaves in the opposite way.
      
      [akpm@linux-foundation.org: export set_dumpable]
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c5d5238
    • J
      fs: remove path_walk export · f79c20f5
      Josef 'Jeff' Sipek 提交于
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f79c20f5
    • J
      fs: mark link_path_walk static · c4a7808f
      Josef 'Jeff' Sipek 提交于
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4a7808f
    • J
      fs: introduce vfs_path_lookup · 16f18200
      Josef 'Jeff' Sipek 提交于
      Stackable file systems, among others, frequently need to lookup paths or
      path components starting from an arbitrary point in the namespace
      (identified by a dentry and a vfsmount).  Currently, such file systems use
      lookup_one_len, which is frowned upon [1] as it does not pass the lookup
      intent along; not passing a lookup intent, for example, can trigger BUG_ON's
      when stacking on top of NFSv4.
      
      The first patch introduces a new lookup function to allow lookup starting
      from an arbitrary point in the namespace.  This approach has been suggested
      by Christoph Hellwig [2].
      
      The second patch changes sunrpc to use vfs_path_lookup.
      
      The third patch changes nfsctl.c to use vfs_path_lookup.
      
      The fourth patch marks link_path_walk static.
      
      The fifth, and last patch, unexports path_walk because it is no longer
      unnecessary to call it directly, and using the new vfs_path_lookup is
      cleaner.
      
      For example, the following snippet of code, looks up "some/path/component"
      in a directory pointed to by parent_{dentry,vfsmnt}:
      
      err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
      		      "some/path/component", 0, &nd);
      if (!err) {
      	/* exits */
      
      	...
      
      	/* once done, release the references */
      	path_release(&nd);
      } else if (err == -ENOENT) {
      	/* doesn't exist */
      } else {
      	/* other error */
      }
      
      VFS functions such as lookup_create can be used on the nameidata structure
      to pass the create intent to the file system.
      Signed-off-by: NJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16f18200
    • O
      mm: variable length argument support · b6a2fea3
      Ollie Wild 提交于
      Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
      the old mm into the new mm.
      
      We create the new mm before the binfmt code runs, and place the new stack at
      the very top of the address space.  Once the binfmt code runs and figures out
      where the stack should be, we move it downwards.
      
      It is a bit peculiar in that we have one task with two mm's, one of which is
      inactive.
      
      [a.p.zijlstra@chello.nl: limit stack size]
      Signed-off-by: NOllie Wild <aaw@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      [bunk@stusta.de: unexport bprm_mm_init]
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a2fea3
    • P
      audit: rework execve audit · bdf4c48a
      Peter Zijlstra 提交于
      The purpose of audit_bprm() is to log the argv array to a userspace daemon at
      the end of the execve system call.  Since user-space hasn't had time to run,
      this array is still in pristine state on the process' stack; so no need to
      copy it, we can just grab it from there.
      
      In order to minimize the damage to audit_log_*() copy each string into a
      temporary kernel buffer first.
      
      Currently the audit code requires that the full argument vector fits in a
      single packet.  So currently it does clip the argv size to a (sysctl) limit,
      but only when execve auditing is enabled.
      
      If the audit protocol gets extended to allow for multiple packets this check
      can be removed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NOllie Wild <aaw@google.com>
      Cc: <linux-audit@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdf4c48a
    • P
      arch: personality independent stack top · b111757c
      Peter Zijlstra 提交于
      New arch macro STACK_TOP_MAX it gives the larges valid stack address for the
      architecture in question.
      
      It differs from STACK_TOP in that it will not distinguish between
      personalities but will always return the largest possible address.
      
      This is used to create the initial stack on execve, which we will move down to
      the proper location once the binfmt code has figured out where that is.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NOllie Wild <aaw@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b111757c
    • F
      define new percpu interface for shared data · 5fb7dc37
      Fenghua Yu 提交于
      per cpu data section contains two types of data.  One set which is
      exclusively accessed by the local cpu and the other set which is per cpu,
      but also shared by remote cpus.  In the current kernel, these two sets are
      not clearely separated out.  This can potentially cause the same data
      cacheline shared between the two sets of data, which will result in
      unnecessary bouncing of the cacheline between cpus.
      
      One way to fix the problem is to cacheline align the remotely accessed per
      cpu data, both at the beginning and at the end.  Because of the padding at
      both ends, this will likely cause some memory wastage and also the
      interface to achieve this is not clean.
      
      This patch:
      
      Moves the remotely accessed per cpu data (which is currently marked
      as ____cacheline_aligned_in_smp) into a different section, where all the data
      elements are cacheline aligned. And as such, this differentiates the local
      only data and remotely accessed data cleanly.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fb7dc37
    • M
      jprobes: make jprobes a little safer for users · 3d7e3382
      Michael Ellerman 提交于
      I realise jprobes are a razor-blades-included type of interface, but that
      doesn't mean we can't try and make them safer to use.  This guy I know once
      wrote code like this:
      
      struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" };
      
      And then his kernel exploded. Oops.
      
      This patch adds an arch hook, arch_deref_entry_point() (I don't like it
      either) which takes the void * in a struct jprobe, and gives back the text
      address that it represents.
      
      We can then use that in register_jprobe() to check that the entry point we're
      passed is actually in the kernel text, rather than just some random value.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d7e3382
    • M
      jprobes: remove JPROBE_ENTRY() · 9e367d85
      Michael Ellerman 提交于
      AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything
      useful - so remove it ..
      
      I've left a do-nothing version so that out-of-tree jprobes code will still
      compile without modifications.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e367d85
    • M
      jprobes: make struct jprobe.entry a void * · 81eae375
      Michael Ellerman 提交于
      Currently jprobe.entry is a kprobe_opcode_t *, but that's a lie.  On some
      platforms it doesn't point to an opcode at all, it points to a function
      descriptor.
      
      It's really a pointer to something that the arch code can turn into a function
      entry point.  And that's what actually happens, none of the generic code ever
      looks at jprobe.entry, it's only ever dereferenced by arch code.
      
      So just make it a void *.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81eae375
    • F
      readahead: sanify file_ra_state names · f9acc8c7
      Fengguang Wu 提交于
      Rename some file_ra_state variables and remove some accessors.
      
      It results in much simpler code.
      Kudos to Rusty!
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9acc8c7
    • R
      readahead: split ondemand readahead interface into two functions · cf914a7d
      Rusty Russell 提交于
      Split ondemand readahead interface into two functions.  I think this makes it
      a little clearer for non-readahead experts (like Rusty).
      
      Internally they both call ondemand_readahead(), but the page argument is
      changed to an obvious boolean flag.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf914a7d
    • F
      mm: share PG_readahead and PG_reclaim · fe3cba17
      Fengguang Wu 提交于
      Share the same page flag bit for PG_readahead and PG_reclaim.
      
      One is used only on file reads, another is only for emergency writes.  One
      is used mostly for fresh/young pages, another is for old pages.
      
      Combinations of possible interactions are:
      
      a) clear PG_reclaim => implicit clear of PG_readahead
      	it will delay an asynchronous readahead into a synchronous one
      	it actually does _good_ for readahead:
      		the pages will be reclaimed soon, it's readahead thrashing!
      		in this case, synchronous readahead makes more sense.
      
      b) clear PG_readahead => implicit clear of PG_reclaim
      	one(and only one) page will not be reclaimed in time
      	it can be avoided by checking PageWriteback(page) in readahead first
      
      c) set PG_reclaim => implicit set of PG_readahead
      	will confuse readahead and make it restart the size rampup process
      	it's a trivial problem, and can mostly be avoided by checking
      	PageWriteback(page) first in readahead
      
      d) set PG_readahead => implicit set of PG_reclaim
      	PG_readahead will never be set on already cached pages.
      	PG_reclaim will always be cleared on dirtying a page.
      	so not a problem.
      
      In summary,
      	a)   we get better behavior
      	b,d) possible interactions can be avoided
      	c)   racy condition exists that might affect readahead, but the chance
      	     is _really_ low, and the hurt on readahead is trivial.
      
      Compound pages also use PG_reclaim, but for now they do not interact with
      reclaim/readahead code.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe3cba17
    • F
      readahead: remove the old algorithm · c743d96b
      Fengguang Wu 提交于
      Remove the old readahead algorithm.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c743d96b
    • F
      readahead: on-demand readahead logic · 122a21d1
      Fengguang Wu 提交于
      This is a minimal readahead algorithm that aims to replace the current one.
      It is more flexible and reliable, while maintaining almost the same behavior
      and performance.  Also it is full integrated with adaptive readahead.
      
      It is designed to be called on demand:
      	- on a missing page, to do synchronous readahead
      	- on a lookahead page, to do asynchronous readahead
      
      In this way it eliminated the awkward workarounds for cache hit/miss,
      readahead thrashing, retried read, and unaligned read.  It also adopts the
      data structure introduced by adaptive readahead, parameterizes readahead
      pipelining with `lookahead_index', and reduces the current/ahead windows to
      one single window.
      
      HEURISTICS
      
      The logic deals with four cases:
      
      	- sequential-next
      		found a consistent readahead window, so push it forward
      
      	- random
      		standalone small read, so read as is
      
      	- sequential-first
      		create a new readahead window for a sequential/oversize request
      
      	- lookahead-clueless
      		hit a lookahead page not associated with the readahead window,
      		so create a new readahead window and ramp it up
      
      In each case, three parameters are determined:
      
      	- readahead index: where the next readahead begins
      	- readahead size:  how much to readahead
      	- lookahead size:  when to do the next readahead (for pipelining)
      
      BEHAVIORS
      
      The old behaviors are maximally preserved for trivial sequential/random reads.
      Notable changes are:
      
      	- It no longer imposes strict sequential checks.
      	  It might help some interleaved cases, and clustered random reads.
      	  It does introduce risks of a random lookahead hit triggering an
      	  unexpected readahead. But in general it is more likely to do good
      	  than to do evil.
      
      	- Interleaved reads are supported in a minimal way.
      	  Their chances of being detected and proper handled are still low.
      
      	- Readahead thrashings are better handled.
      	  The current readahead leads to tiny average I/O sizes, because it
      	  never turn back for the thrashed pages.  They have to be fault in
      	  by do_generic_mapping_read() one by one.  Whereas the on-demand
      	  readahead will redo readahead for them.
      
      OVERHEADS
      
      The new code reduced the overheads of
      
      	- excessively calling the readahead routine on small sized reads
      	  (the current readahead code insists on seeing all requests)
      
      	- doing a lot of pointless page-cache lookups for small cached files
      	  (the current readahead only turns itself off after 256 cache hits,
      	  unfortunately most files are < 1MB, so never see that chance)
      
      That accounts for speedup of
      	- 0.3% on 1-page sequential reads on sparse file
      	- 1.2% on 1-page cache hot sequential reads
      	- 3.2% on 256-page cache hot sequential reads
      	- 1.3% on cache hot `tar /lib`
      
      However, it does introduce one extra page-cache lookup per cache miss, which
      impacts random reads slightly. That's 1% overheads for 1-page random reads on
      sparse file.
      
      PERFORMANCE
      
      The basic benchmark setup is
      	- 2.6.20 kernel with on-demand readahead
      	- 1MB max readahead size
      	- 2.9GHz Intel Core 2 CPU
      	- 2GB memory
      	- 160G/8M Hitachi SATA II 7200 RPM disk
      
      The benchmarks show that
      	- it maintains the same performance for trivial sequential/random reads
      	- sysbench/OLTP performance on MySQL gains up to 8%
      	- performance on readahead thrashing gains up to 3 times
      
      iozone throughput (KB/s): roughly the same
      ==========================================
      iozone -c -t1 -s 4096m -r 64k
      
      			       2.6.20          on-demand      gain
      first run
      	  "  Initial write "   61437.27        64521.53      +5.0%
      	  "        Rewrite "   47893.02        48335.20      +0.9%
      	  "           Read "   62111.84        62141.49      +0.0%
      	  "        Re-read "   62242.66        62193.17      -0.1%
      	  "   Reverse Read "   50031.46        49989.79      -0.1%
      	  "    Stride read "    8657.61         8652.81      -0.1%
      	  "    Random read "   13914.28        13898.23      -0.1%
      	  " Mixed workload "   19069.27        19033.32      -0.2%
      	  "   Random write "   14849.80        14104.38      -5.0%
      	  "         Pwrite "   62955.30        65701.57      +4.4%
      	  "          Pread "   62209.99        62256.26      +0.1%
      
      second run
      	  "  Initial write "   60810.31        66258.69      +9.0%
      	  "        Rewrite "   49373.89        57833.66     +17.1%
      	  "           Read "   62059.39        62251.28      +0.3%
      	  "        Re-read "   62264.32        62256.82      -0.0%
      	  "   Reverse Read "   49970.96        50565.72      +1.2%
      	  "    Stride read "    8654.81         8638.45      -0.2%
      	  "    Random read "   13901.44        13949.91      +0.3%
      	  " Mixed workload "   19041.32        19092.04      +0.3%
      	  "   Random write "   14019.99        14161.72      +1.0%
      	  "         Pwrite "   64121.67        68224.17      +6.4%
      	  "          Pread "   62225.08        62274.28      +0.1%
      
      In summary, writes are unstable, reads are pretty close on average:
      
      			  access pattern  2.6.20  on-demand   gain
      				   Read  62085.61  62196.38  +0.2%
      				Re-read  62253.49  62224.99  -0.0%
      			   Reverse Read  50001.21  50277.75  +0.6%
      			    Stride read   8656.21   8645.63  -0.1%
      			    Random read  13907.86  13924.07  +0.1%
      	 		 Mixed workload  19055.29  19062.68  +0.0%
      				  Pread  62217.53  62265.27  +0.1%
      
      aio-stress: roughly the same
      ============================
      aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
      aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
      
      					2.6.20      on-demand  delta
      			sequential	 92.57s      92.54s    -0.0%
      			random		311.87s     312.15s    +0.1%
      
      sysbench fileio: roughly the same
      =================================
      sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
      	 --file-total-size=4G --file-block-size=64K \
      	 --num-threads=001 --max-requests=10000 --max-time=900 run
      
      				threads    2.6.20   on-demand    delta
      		first run
      				      1   59.1974s    59.2262s  +0.0%
      				      2   58.0575s    58.2269s  +0.3%
      				      4   48.0545s    47.1164s  -2.0%
      				      8   41.0684s    41.2229s  +0.4%
      				     16   35.8817s    36.4448s  +1.6%
      				     32   32.6614s    32.8240s  +0.5%
      				     64   23.7601s    24.1481s  +1.6%
      				    128   24.3719s    23.8225s  -2.3%
      				    256   23.2366s    22.0488s  -5.1%
      
      		second run
      				      1   59.6720s    59.5671s  -0.2%
      				      8   41.5158s    41.9541s  +1.1%
      				     64   25.0200s    23.9634s  -4.2%
      				    256   22.5491s    20.9486s  -7.1%
      
      Note that the numbers are not very stable because of the writes.
      The overall performance is close when we sum all seconds up:
      
                      sum all up               495.046s    491.514s   -0.7%
      
      sysbench oltp (trans/sec): up to 8% gain
      ========================================
      sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
      	 --mysql-socket=/var/run/mysqld/mysqld.sock \
      	 --mysql-user=root --mysql-password=readahead \
      	 --num-threads=064 --max-requests=10000 --max-time=900 run
      
      	10000-transactions run
      				threads    2.6.20   on-demand    gain
      				      1     62.81       64.56   +2.8%
      				      2     67.97       70.93   +4.4%
      				      4     81.81       85.87   +5.0%
      				      8     94.60       97.89   +3.5%
      				     16     99.07      104.68   +5.7%
      				     32     95.93      104.28   +8.7%
      				     64     96.48      103.68   +7.5%
      	5000-transactions run
      				      1     48.21       48.65   +0.9%
      				      8     68.60       70.19   +2.3%
      				     64     70.57       74.72   +5.9%
      	2000-transactions run
      				      1     37.57       38.04   +1.3%
      				      2     38.43       38.99   +1.5%
      				      4     45.39       46.45   +2.3%
      				      8     51.64       52.36   +1.4%
      				     16     54.39       55.18   +1.5%
      				     32     52.13       54.49   +4.5%
      				     64     54.13       54.61   +0.9%
      
      That's interesting results. Some investigations show that
      	- MySQL is accessing the db file non-uniformly: some parts are
      	  more hot than others
      	- It is mostly doing 4-page random reads, and sometimes doing two
      	  reads in a row, the latter one triggers a 16-page readahead.
      	- The on-demand readahead leaves many lookahead pages (flagged
      	  PG_readahead) there. Many of them will be hit, and trigger
      	  more readahead pages. Which might save more seeks.
      	- Naturally, the readahead windows tend to lie in hot areas,
      	  and the lookahead pages in hot areas is more likely to be hit.
      	- The more overall read density, the more possible gain.
      
      That also explains the adaptive readahead tricks for clustered random reads.
      
      readahead thrashing: 3 times better
      ===================================
      We boot kernel with "mem=128m single", and start a 100KB/s stream on every
      second, until reaching 200 streams.
      
      			      max throughput     min avg I/O size
      		2.6.20:            5MB/s               16KB
      		on-demand:        15MB/s              140KB
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122a21d1