1. 08 4月, 2014 16 次提交
    • M
      arm64: add early_ioremap support · bf4b558e
      Mark Salter 提交于
      Add support for early IO or memory mappings which are needed before the
      normal ioremap() is usable.  This also adds fixmap support for permanent
      fixed mappings such as that used by the earlyprintk device register
      region.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf4b558e
    • M
      arm64: initialize pgprot info earlier in boot · 0bf757c7
      Mark Salter 提交于
      Presently, paging_init() calls init_mem_pgprot() to initialize pgprot
      values used by macros such as PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.
      
      The new fixmap and early_ioremap support also needs to use these macros
      before paging_init() is called.  This patch moves the init_mem_pgprot()
      call out of paging_init() and into setup_arch() so that pgprot_default
      gets initialized in time for fixmap and early_ioremap.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf757c7
    • M
      x86: use generic early_ioremap · 5b7c73e0
      Mark Salter 提交于
      Move x86 over to the generic early ioremap implementation.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b7c73e0
    • D
      x86/mm: sparse warning fix for early_memremap · 6b550f6f
      Dave Young 提交于
      This patch series takes the common bits from the x86 early ioremap
      implementation and creates a generic implementation which may be used by
      other architectures.  The early ioremap interfaces are intended for
      situations where boot code needs to make temporary virtual mappings
      before the normal ioremap interfaces are available.  Typically, this
      means before paging_init() has run.
      
      This patch (of 6):
      
      There's a lot of sparse warnings for code like below: void *a =
      early_memremap(phys_addr, size);
      
      early_memremap intend to map kernel memory with ioremap facility, the
      return pointer should be a kernel ram pointer instead of iomem one.
      
      For making the function clearer and supressing sparse warnings this patch
      do below two things:
      1. cast to (__force void *) for the return value of early_memremap
      2. add early_memunmap function and pass (__force void __iomem *) to iounmap
      
      From Boris:
        "Ingo told me yesterday, it makes sense too.  I'd guess we can try it.
         FWIW, all callers of early_memremap use the memory they get remapped
         as normal memory so we should be safe"
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b550f6f
    • C
      percpu: add raw_cpu_ops · b3ca1c10
      Christoph Lameter 提交于
      The kernel has never been audited to ensure that this_cpu operations are
      consistently used throughout the kernel.  The code generated in many
      places can be improved through the use of this_cpu operations (which
      uses a segment register for relocation of per cpu offsets instead of
      performing address calculations).
      
      The patch set also addresses various consistency issues in general with
      the per cpu macros.
      
      A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
         because checks are skipped. This is typically shown through a raw_
         prefix. So this patch set changes the places where __this_cpu_ptr()
         is used to raw_cpu_ptr().
      
      B. There has been the long term wish by some that __this_cpu operations
         would check for preemption. However, there are cases where preemption
         checks need to be skipped. This patch set adds raw_cpu operations that
         do not check for preemption and then adds preemption checks to the
         __this_cpu operations.
      
      C. The use of __get_cpu_var is always a reference to a percpu variable
         that can also be handled via a this_cpu operation. This patch set
         replaces all uses of __get_cpu_var with this_cpu operations.
      
      D. We can then use this_cpu RMW operations in various places replacing
         sequences of instructions by a single one.
      
      E. The use of this_cpu operations throughout will allow other arches than
         x86 to implement optimized references and RMV operations to work with
         per cpu local data.
      
      F. The use of this_cpu operations opens up the possibility to
         further optimize code that relies on synchronization through
         per cpu data.
      
      The patch set works in a couple of stages:
      
      I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
          Also converts the existing __this_cpu_xx_# primitive in the x86
          code to raw_cpu_xx_#.
      
      II. Patch 2-4 use the raw_cpu operations in places that would give
           us false positives once they are enabled.
      
      III. Patch 5 adds preemption checks to __this_cpu operations to allow
          checking if preemption is properly disabled when these functions
          are used.
      
      IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
         with this_cpu_ptr. They do not depend on any changes to the percpu
         code. No preemption tests are skipped if they are applied.
      
      V. Patches 21-46 are conversion patches that use this_cpu operations
         in various kernel subsystems/drivers or arch code.
      
      VI.  Patches 47/48 (not included in this series) remove no longer used
          functions (__this_cpu_ptr and __get_cpu_var).  These should only be
          applied after all the conversion patches have made it and after we
          have done additional passes through the kernel to ensure that none of
          the uses of these functions remain.
      
      This patch (of 46):
      
      The patches following this one will add preemption checks to __this_cpu
      ops so we need to have an alternative way to use this_cpu operations
      without preemption checks.
      
      raw_cpu_ops will be the basis for all other ops since these will be the
      operations that do not implement any checks.
      
      Primitive operations are renamed by this patch from __this_cpu_xxx to
      raw_cpu_xxxx.
      
      Also change the uses of the x86 percpu primitives in preempt.h.
      These depend directly on asm/percpu.h (header #include nesting issue).
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bryan Wu <cooloney@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wim Van Sebroeck <wim@iguana.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ca1c10
    • D
      asm/system.h: um: arch_align_stack() moved to asm/exec.h · cf7bc58f
      David Howells 提交于
      arch_align_stack() moved to asm/exec.h, so change the comment referring to
      asm/system.h which no longer exists.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7bc58f
    • U
      Kconfig: rename HAS_IOPORT to HAS_IOPORT_MAP · ce816fa8
      Uwe Kleine-König 提交于
      If the renamed symbol is defined lib/iomap.c implements ioport_map and
      ioport_unmap and currently (nearly) all platforms define the port
      accessor functions outb/inb and friend unconditionally.  So
      HAS_IOPORT_MAP is the better name for this.
      
      Consequently NO_IOPORT is renamed to NO_IOPORT_MAP.
      
      The motivation for this change is to reintroduce a symbol HAS_IOPORT
      that signals if outb/int et al are available.  I will address that at
      least one merge window later though to keep surprises to a minimum and
      catch new introductions of (HAS|NO)_IOPORT.
      
      The changes in this commit were done using:
      
      	$ git grep -l -E '(NO|HAS)_IOPORT' | xargs perl -p -i -e 's/\b((?:CONFIG_)?(?:NO|HAS)_IOPORT)\b/$1_MAP/'
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce816fa8
    • J
      x86: always define BUG() and HAVE_ARCH_BUG, even with !CONFIG_BUG · b06dd879
      Josh Triplett 提交于
      This ensures that BUG() always has a definition that causes a trap (via
      an undefined instruction), and that the compiler still recognizes the
      code following BUG() as unreachable, avoiding warnings that would
      otherwise appear (such as on non-void functions that don't return a
      value after BUG()).
      
      In addition to saving a few bytes over the generic infinite-loop
      implementation, this implementation traps rather than looping, which
      potentially allows for better error-recovery behavior (such as by
      rebooting).
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b06dd879
    • J
      ppc: make PPC_BOOK3S_64 select IRQ_WORK · 527518f1
      Josh Triplett 提交于
      Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
      disable options behind EMBEDDED and EXPERT".
      
      arch/powerpc/kernel/mce.c, compiled in for PPC_BOOK3S_64, calls
      functions only built when IRQ_WORK, so select it.  Fixes the following
      build error:
      
        arch/powerpc/kernel/built-in.o: In function `.machine_check_queue_event':
        (.text+0x11260): undefined reference to `.irq_work_queue'
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527518f1
    • J
      ia64: select CONFIG_TTY for use of tty_write_message in unaligned · 6035d9db
      Josh Triplett 提交于
      Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
      disable options behind EMBEDDED and EXPERT".
      
      arch/ia64/kernel/unaligned.c uses tty_write_message to print an
      unaligned access exception to the TTY of the current user process.
      Enable TTY to prevent a build error.
      
      Minimal fix, on the basis that few people on ia64 will care deeply about
      kernel size enough to turn off TTY.  Ideally, I'd instead suggest
      dropping the tty_write_message entirely, and just leaving the printk.
      Bonus: no need to sprintf first.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6035d9db
    • G
      cris: cpuinfo_op should depend on CONFIG_PROC_FS · c638b107
      Geert Uytterhoeven 提交于
      Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
      disable options behind EMBEDDED and EXPERT".
      
      Now allnoconfig started disabling CONFIG_PROC_FS:
      
          arch/cris/kernel/built-in.o:(.rodata+0xc): undefined reference to `show_cpuinfo'
          make: *** [vmlinux] Error 1
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c638b107
    • J
      cris: make ETRAX_ARCH_V10 select TTY for use in debugport · ae797bdf
      Josh Triplett 提交于
      Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
      disable options behind EMBEDDED and EXPERT".
      
      arch/cris/arch-v10/kernel/debugport.c, compiled in unconditionally with
      ETRAX_ARCH_V10, requires TTY, so select TTY to avoid a build failure.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae797bdf
    • A
      rapidio: rework device hierarchy and introduce mport class of devices · 2aaf308b
      Alexandre Bounine 提交于
      This patch removes an artificial RapidIO bus root device and establishes
      actual device hierarchy by providing reference to real parent devices.
      It also introduces device class for RapidIO controller devices (on-chip
      or an eternal bridge, known as "mport").
      
      Existing implementation was sufficient for SoC-based platforms that have
      a single RapidIO controller.  With introduction of devices using
      multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme
      is very limiting or does not work at all.  The implemented changes allow
      to properly reference platform's local RapidIO mport devices and provide
      device details needed for upper layers.
      
      This change to RapidIO device hierarchy does not break any known
      existing kernel or user space interfaces.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
      Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com>
      Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aaf308b
    • R
      include/linux/crash_dump.h: add vmcore_cleanup() prototype · 82e0703b
      Rashika Kheria 提交于
      Eliminate the following warning in proc/vmcore.c:
      
        fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]
      
      [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82e0703b
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • A
      mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" · 1e1836e8
      Alex Thorlton 提交于
      The main motivation behind this patch is to provide a way to disable THP
      for jobs where the code cannot be modified, and using a malloc hook with
      madvise is not an option (i.e.  statically allocated data).  This patch
      allows us to do just that, without affecting other jobs running on the
      system.
      
      We need to do this sort of thing for jobs where THP hurts performance,
      due to the possibility of increased remote memory accesses that can be
      created by situations such as the following:
      
      When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
      be handed out, and the THP will be stuck on whatever node the chunk was
      originally referenced from.  If many remote nodes need to do work on
      that same chunk, they'll be making remote accesses.
      
      With THP disabled, 4K pages can be handed out to separate nodes as
      they're needed, greatly reducing the amount of remote accesses to
      memory.
      
      This patch is based on some of my work combined with some
      suggestions/patches given by Oleg Nesterov.  The main goal here is to
      add a prctl switch to allow us to disable to THP on a per mm_struct
      basis.
      
      Here's a bit of test data with the new patch in place...
      
      First with the flag unset:
      
        # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 0
        Set thp_disabled state to 0
        Process pid = 18027
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
                 1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
                     7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
                 1,698,932 page-faults               #    0.000 M/sec
        355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
        536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
        409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
        148,286,797,266,411 instructions             #    0.42  insns per cycle
                                                     #    3.62  stalled cycles per insn [100.00%]
        27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
             1,188,655,196 branch-misses             #    0.00% of all branches
      
             427.001706337 seconds time elapsed
      
      Now with the flag set:
      
        # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 1
        Set thp_disabled state to 1
        Process pid = 144957
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
                   534,205 context-switches          #    0.000 M/sec                   [100.00%]
                     4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
                63,133,119 page-faults               #    0.000 M/sec
        147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
        200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
        105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
        180,916,213,503,160 instructions             #    1.22  insns per cycle
                                                     #    1.11  stalled cycles per insn [100.00%]
        26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
               714,066,351 branch-misses             #    0.00% of all branches
      
             216.196778807 seconds time elapsed
      
      As with previous versions of the patch, We're getting about a 2x
      performance increase here.  Here's a link to the test case I used, along
      with the little wrapper to activate the flag:
      
        http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz
      
      This patch (of 3):
      
      Revert commit 8e72033f and add in code to fix up any issues caused
      by the revert.
      
      The revert is necessary because hugepage_madvise would return -EINVAL
      when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
      patch set.
      
      Here's a snip of an e-mail from Gerald detailing the original purpose of
      this code, and providing justification for the revert:
      
        "The intent of commit 8e72033f was to guard against any future
         programming errors that may result in an madvice(MADV_HUGEPAGE) on
         guest mappings, which would crash the kernel.
      
         Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
         8e72033f was to be reverted, because that check will also prevent
         a kernel crash in the case described above, it will now send a
         SIGSEGV instead.
      
         This would now also allow to do the madvise on other parts, if
         needed, so it is a more flexible approach.  One could also say that
         it would have been better to do it this way right from the
         beginning..."
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e1836e8
  2. 05 4月, 2014 22 次提交
  3. 04 4月, 2014 2 次提交