1. 13 3月, 2010 4 次提交
    • F
      pci-dma: add linux/pci-dma.h to linux/pci.h · f41b1771
      FUJITA Tomonori 提交于
      All the architectures properly set NEED_DMA_MAP_STATE now so we can safely
      add linux/pci-dma.h to linux/pci.h and remove the linux/pci-dma.h
      inclusion in arch's asm/pci.h
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f41b1771
    • F
      pci-dma: ia64: use include/linux/pci-dma.h · 66ed5ef8
      FUJITA Tomonori 提交于
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66ed5ef8
    • C
      ptrace: move user_enable_single_step & co prototypes to linux/ptrace.h · dacbe41f
      Christoph Hellwig 提交于
      While in theory user_enable_single_step/user_disable_single_step/
      user_enable_blockstep could also be provided as an inline or macro there's
      no good reason to do so, and having the prototype in one places keeps code
      size and confusion down.
      
      Roland said:
      
        The original thought there was that user_enable_single_step() et al
        might well be only an instruction or three on a sane machine (as if we
        have any of those!), and since there is only one call site inlining
        would be beneficial.  But I agree that there is no strong reason to care
        about inlining it.
      
        As to the arch changes, there is only one thought I'd add to the
        record.  It was always my thinking that for an arch where
        PTRACE_SINGLESTEP does text-modifying breakpoint insertion,
        user_enable_single_step() should not be provided.  That is,
        arch_has_single_step()=>true means that there is an arch facility with
        "pure" semantics that does not have any unexpected side effects.
        Inserting a breakpoint might do very unexpected strange things in
        multi-threaded situations.  Aside from that, it is a peculiar side
        effect that user_{enable,disable}_single_step() should cause COW
        de-sharing of text pages and so forth.  For PTRACE_SINGLESTEP, all these
        peculiarities are the status quo ante for that arch, so having
        arch_ptrace() itself do those is one thing.  But for building other
        things in the future, it is nicer to have a uniform "pure" semantics
        that arch-independent code can expect.
      
        OTOH, all such arch issues are really up to the arch maintainer.  As
        of today, there is nothing but ptrace using user_enable_single_step() et
        al so it's a distinction without a practical difference.  If/when there
        are other facilities that use user_enable_single_step() and might care,
        the affected arch's can revisit the question when someone cares about
        the quality of the arch support for said new facility.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dacbe41f
    • C
      improve sys_newuname() for compat architectures · e28cbf22
      Christoph Hellwig 提交于
      On an architecture that supports 32-bit compat we need to override the
      reported machine in uname with the 32-bit value.  Instead of doing this
      separately in every architecture introduce a COMPAT_UTS_MACHINE define in
      <asm/compat.h> and apply it directly in sys_newuname().
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e28cbf22
  2. 08 3月, 2010 1 次提交
  3. 07 3月, 2010 3 次提交
    • D
      elf coredump: add extended numbering support · 8d9032bb
      Daisuke HATAYAMA 提交于
      The current ELF dumper implementation can produce broken corefiles if
      program headers exceed 65535.  This number is determined by the number of
      vmas which the process have.  In particular, some extreme programs may use
      more than 65535 vmas.  (If you google max_map_count, you can find some
      users facing this problem.) This kind of program never be able to generate
      correct coredumps.
      
      This patch implements ``extended numbering'' that uses sh_info field of
      the first section header instead of e_phnum field in order to represent
      upto 4294967295 vmas.
      
      This is supported by
      AMD64-ABI(http://www.x86-64.org/documentation.html) and
      Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
      Of course, we are preparing patches for gdb and binutils.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d9032bb
    • D
      elf coredump: replace ELF_CORE_EXTRA_* macros by functions · 1fcccbac
      Daisuke HATAYAMA 提交于
      elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
      macro for hiding _multiline_ logics in functions.  This patch removes
      #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions.  For
      architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
      order to reduce a range of modification.
      
      This cleanup is for my next patches, but I think this cleanup itself is
      worth doing regardless of my firnal purpose.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fcccbac
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
  4. 01 3月, 2010 11 次提交
  5. 27 2月, 2010 3 次提交
  6. 26 2月, 2010 2 次提交
    • A
      [IA64] build arch/ia64/kernel/acpi-ext.o when CONFIG_ACPI · e72aca30
      Alex Chiang 提交于
      Simplify the makefile slightly by always building acpi-ext.o when
      CONFIG_ACPI is turned on.
      
      Yes, this adds a little bloat to the other configs, but not much:
         text	   data	    bss	    dec	    hex	filename
          839	     41	      0	    880	    370	arch/ia64/kernel/acpi-ext.o
      
      Before:
         text	   data	    bss	    dec		    hex	filename
      10952753	1299212	1334241	13586206	 cf4f1e	vmlinux
      
      After:
         text	   data	    bss	    dec		    hex	filename
      10953739	1299084	1334241	13587064	 cf5278	vmlinux
      
      (gdb) p 13587064 - 13586206
      $2 = 858
      
      Seems like a small price to pay for the benefit of not having to think
      so hard about the multitude of ia64 configs when reading code/Makefiles.
      Signed-off-by: NAlex Chiang <achiang@hp.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      e72aca30
    • A
      [IA64] Only build arch/ia64/kernel/acpi.o when CONFIG_ACPI · d868080d
      Alex Chiang 提交于
      The following commit broke the ia64 sim_defconfig build:
      	3b2b84c0b81108a9a869a88bf2beeb5a95d81dd1
      	ACPI: processor: driver doesn't need to evaluate _PDC
      
      This is because it added:
      	+#include <acpi/processor.h>
      
      To arch/ia64/kernel/acpi.c. Unfortunately, the ia64_simdefconfig does
      not turn on CONFIG_ACPI, and we get build errors.
      
      The fix described in $subject seems to be the most sensible way to
      untangle the mess.
      
      The other issue is that acpi_get_sysname() is required for all configs,
      most of which define CONFIG_ACPI, but are not CONFIG_IA64_GENERIC. Turn
      it into an inline to cover the "non generic" ia64 configs; to prevent
      a duplicate definition build error, we need to wrap the definition in
      acpi.o inside an #ifdef.
      
      Finally, move the pm_idle and pm_power_off exports into process.c (which
      is always built), similar to other architectures, and allow the sim
      defconfig to link.
      Signed-off-by: NAlex Chiang <achiang@hp.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      d868080d
  7. 24 2月, 2010 4 次提交
  8. 23 2月, 2010 2 次提交
  9. 21 2月, 2010 1 次提交
    • R
      MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself · 4b3073e1
      Russell King 提交于
      On VIVT ARM, when we have multiple shared mappings of the same file
      in the same MM, we need to ensure that we have coherency across all
      copies.  We do this via make_coherent() by making the pages
      uncacheable.
      
      This used to work fine, until we allowed highmem with highpte - we
      now have a page table which is mapped as required, and is not available
      for modification via update_mmu_cache().
      
      Ralf Beache suggested getting rid of the PTE value passed to
      update_mmu_cache():
      
        On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
        to construct a pointer to the pte again.  Passing a pte_t * is much
        more elegant.  Maybe we might even replace the pte argument with the
        pte_t?
      
      Ben Herrenschmidt would also like the pte pointer for PowerPC:
      
        Passing the ptep in there is exactly what I want.  I want that
        -instead- of the PTE value, because I have issue on some ppc cases,
        for I$/D$ coherency, where set_pte_at() may decide to mask out the
        _PAGE_EXEC.
      
      So, pass in the mapped page table pointer into update_mmu_cache(), and
      remove the PTE value, updating all implementations and call sites to
      suit.
      
      Includes a fix from Stephen Rothwell:
      
        sparc: fix fallout from update_mmu_cache API change
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      4b3073e1
  10. 19 2月, 2010 1 次提交
  11. 18 2月, 2010 2 次提交
  12. 13 2月, 2010 1 次提交
    • T
      [IA64] preserve personality flag bits across exec · 22208ac5
      Tony Luck 提交于
      In its <asm/elf.h> ia64 defines SET_PERSONALITY in a way that unconditionally
      sets the personality of the current process to PER_LINUX, losing any flag bits
      from the upper 3 bytes of current->personality.  This is wrong. Those bits are
      intended to be inherited across exec (other code takes care of ensuring that
      security sensitive bits like ADDR_NO_RANDOMIZE are not passed to unsuspecting
      setuid/setgid applications).
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      22208ac5
  13. 09 2月, 2010 1 次提交
    • T
      [IA64] Remove COMPAT_IA32 support · 32974ad4
      Tony Luck 提交于
      This has been broken since May 2008 when Al Viro killed altroot support.
      Since nobody has complained, it would appear that there are no users of
      this code (A plausible theory since the main OSVs that support ia64 prefer
      to use the IA32-EL software emulation).
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      32974ad4
  14. 05 2月, 2010 1 次提交
  15. 04 2月, 2010 1 次提交
    • M
      kprobes: Disable booster when CONFIG_PREEMPT=y · 615d0ebb
      Masami Hiramatsu 提交于
      Disable kprobe booster when CONFIG_PREEMPT=y at this time,
      because it can't ensure that all kernel threads preempted on
      kprobe's boosted slot run out from the slot even using
      freeze_processes().
      
      The booster on preemptive kernel will be resumed if
      synchronize_tasks() or something like that is introduced.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: systemtap <systemtap@sources.redhat.com>
      Cc: DLE <dle-develop@lists.sourceforge.net>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      LKML-Reference: <20100202214904.4694.24330.stgit@dhcp-100-2-132.bos.redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      615d0ebb
  16. 28 1月, 2010 1 次提交
  17. 15 1月, 2010 1 次提交
    • M
      vhost_net: a kernel-level virtio server · 3a4d5c94
      Michael S. Tsirkin 提交于
      What it is: vhost net is a character device that can be used to reduce
      the number of system calls involved in virtio networking.
      Existing virtio net code is used in the guest without modification.
      
      There's similarity with vringfd, with some differences and reduced scope
      - uses eventfd for signalling
      - structures can be moved around in memory at any time (good for
        migration, bug work-arounds in userspace)
      - write logging is supported (good for migration)
      - support memory table and not just an offset (needed for kvm)
      
      common virtio related code has been put in a separate file vhost.c and
      can be made into a separate module if/when more backends appear.  I used
      Rusty's lguest.c as the source for developing this part : this supplied
      me with witty comments I wouldn't be able to write myself.
      
      What it is not: vhost net is not a bus, and not a generic new system
      call. No assumptions are made on how guest performs hypercalls.
      Userspace hypervisors are supported as well as kvm.
      
      How it works: Basically, we connect virtio frontend (configured by
      userspace) to a backend. The backend could be a network device, or a tap
      device.  Backend is also configured by userspace, including vlan/mac
      etc.
      
      Status: This works for me, and I haven't see any crashes.
      Compared to userspace, people reported improved latency (as I save up to
      4 system calls per packet), as well as better bandwidth and CPU
      utilization.
      
      Features that I plan to look at in the future:
      - mergeable buffers
      - zero copy
      - scalability tuning: figure out the best threading model to use
      
      Note on RCU usage (this is also documented in vhost.h, near
      private_pointer which is the value protected by this variant of RCU):
      what is happening is that the rcu_dereference() is being used in a
      workqueue item.  The role of rcu_read_lock() is taken on by the start of
      execution of the workqueue item, of rcu_read_unlock() by the end of
      execution of the workqueue item, and of synchronize_rcu() by
      flush_workqueue()/flush_work(). In the future we might need to apply
      some gcc attribute or sparse annotation to the function passed to
      INIT_WORK(). Paul's ack below is for this RCU usage.
      
      (Includes fixes by Alan Cox <alan@linux.intel.com>,
      David L Stevens <dlstevens@us.ibm.com>,
      Chris Wright <chrisw@redhat.com>)
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a4d5c94