1. 31 5月, 2014 1 次提交
    • M
      x86_64: expand kernel stack to 16K · 6538b8ea
      Minchan Kim 提交于
      While I play inhouse patches with much memory pressure on qemu-kvm,
      3.14 kernel was randomly crashed. The reason was kernel stack overflow.
      
      When I investigated the problem, the callstack was a little bit deeper
      by involve with reclaim functions but not direct reclaim path.
      
      I tried to diet stack size of some functions related with alloc/reclaim
      so did a hundred of byte but overflow was't disappeard so that I encounter
      overflow by another deeper callstack on reclaim/allocator path.
      
      Of course, we might sweep every sites we have found for reducing
      stack usage but I'm not sure how long it saves the world(surely,
      lots of developer start to add nice features which will use stack
      agains) and if we consider another more complex feature in I/O layer
      and/or reclaim path, it might be better to increase stack size(
      meanwhile, stack usage on 64bit machine was doubled compared to 32bit
      while it have sticked to 8K. Hmm, it's not a fair to me and arm64
      already expaned to 16K. )
      
      So, my stupid idea is just let's expand stack size and keep an eye
      toward stack consumption on each kernel functions via stacktrace of ftrace.
      For example, we can have a bar like that each funcion shouldn't exceed 200K
      and emit the warning when some function consumes more in runtime.
      Of course, it could make false positive but at least, it could make a
      chance to think over it.
      
      I guess this topic was discussed several time so there might be
      strong reason not to increase kernel stack size on x86_64, for me not
      knowing so Ccing x86_64 maintainers, other MM guys and virtio
      maintainers.
      
      Here's an example call trace using up the kernel stack:
      
               Depth    Size   Location    (51 entries)
               -----    ----   --------
         0)     7696      16   lookup_address
         1)     7680      16   _lookup_address_cpa.isra.3
         2)     7664      24   __change_page_attr_set_clr
         3)     7640     392   kernel_map_pages
         4)     7248     256   get_page_from_freelist
         5)     6992     352   __alloc_pages_nodemask
         6)     6640       8   alloc_pages_current
         7)     6632     168   new_slab
         8)     6464       8   __slab_alloc
         9)     6456      80   __kmalloc
        10)     6376     376   vring_add_indirect
        11)     6000     144   virtqueue_add_sgs
        12)     5856     288   __virtblk_add_req
        13)     5568      96   virtio_queue_rq
        14)     5472     128   __blk_mq_run_hw_queue
        15)     5344      16   blk_mq_run_hw_queue
        16)     5328      96   blk_mq_insert_requests
        17)     5232     112   blk_mq_flush_plug_list
        18)     5120     112   blk_flush_plug_list
        19)     5008      64   io_schedule_timeout
        20)     4944     128   mempool_alloc
        21)     4816      96   bio_alloc_bioset
        22)     4720      48   get_swap_bio
        23)     4672     160   __swap_writepage
        24)     4512      32   swap_writepage
        25)     4480     320   shrink_page_list
        26)     4160     208   shrink_inactive_list
        27)     3952     304   shrink_lruvec
        28)     3648      80   shrink_zone
        29)     3568     128   do_try_to_free_pages
        30)     3440     208   try_to_free_pages
        31)     3232     352   __alloc_pages_nodemask
        32)     2880       8   alloc_pages_current
        33)     2872     200   __page_cache_alloc
        34)     2672      80   find_or_create_page
        35)     2592      80   ext4_mb_load_buddy
        36)     2512     176   ext4_mb_regular_allocator
        37)     2336     128   ext4_mb_new_blocks
        38)     2208     256   ext4_ext_map_blocks
        39)     1952     160   ext4_map_blocks
        40)     1792     384   ext4_writepages
        41)     1408      16   do_writepages
        42)     1392      96   __writeback_single_inode
        43)     1296     176   writeback_sb_inodes
        44)     1120      80   __writeback_inodes_wb
        45)     1040     160   wb_writeback
        46)      880     208   bdi_writeback_workfn
        47)      672     144   process_one_work
        48)      528     112   worker_thread
        49)      416     240   kthread
        50)      176     176   ret_from_fork
      
      [ Note: the problem is exacerbated by certain gcc versions that seem to
        generate much bigger stack frames due to apparently bad coalescing of
        temporaries and generating too many spills.  Rusty saw gcc-4.6.4 using
        35% more stack on the virtio path than 4.8.2 does, for example.
      
        Minchan not only uses such a bad gcc version (4.6.3 in his case), but
        some of the stack use is due to debugging (CONFIG_DEBUG_PAGEALLOC is
        what causes that kernel_map_pages() frame, for example). But we're
        clearly getting too close.
      
        The VM code also seems to have excessive stack frames partly for the
        same compiler reason, triggered by excessive inlining and lots of
        function arguments.
      
        We need to improve on our stack use, but in the meantime let's do this
        simple stack increase too.  Unlike most earlier reports, there is
        nothing simple that stands out as being really horribly wrong here,
        apart from the fact that the stack frames are just bigger than they
        should need to be.        - Linus ]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S Tsirkin <mst@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: PJ Waskiewicz <pjwaskiewicz@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6538b8ea
  2. 14 5月, 2014 1 次提交
  3. 08 5月, 2014 1 次提交
  4. 24 4月, 2014 1 次提交
    • I
      arm: xen: implement multicall hypercall support. · 5e40704e
      Ian Campbell 提交于
      As part of this make the usual change to xen_ulong_t in place of unsigned long.
      This change has no impact on x86.
      
      The Linux definition of struct multicall_entry.result differs from the Xen
      definition, I think for good reasons, and used a long rather than an unsigned
      long. Therefore introduce a xen_long_t, which is a long on x86 architectures
      and a signed 64-bit integer on ARM.
      
      Use uint32_t nr_calls on x86 for consistency with the ARM definition.
      
      Build tested on amd64 and i386 builds. Runtime tested on ARM.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      5e40704e
  5. 15 4月, 2014 1 次提交
  6. 08 4月, 2014 4 次提交
    • M
      x86: use generic early_ioremap · 5b7c73e0
      Mark Salter 提交于
      Move x86 over to the generic early ioremap implementation.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b7c73e0
    • D
      x86/mm: sparse warning fix for early_memremap · 6b550f6f
      Dave Young 提交于
      This patch series takes the common bits from the x86 early ioremap
      implementation and creates a generic implementation which may be used by
      other architectures.  The early ioremap interfaces are intended for
      situations where boot code needs to make temporary virtual mappings
      before the normal ioremap interfaces are available.  Typically, this
      means before paging_init() has run.
      
      This patch (of 6):
      
      There's a lot of sparse warnings for code like below: void *a =
      early_memremap(phys_addr, size);
      
      early_memremap intend to map kernel memory with ioremap facility, the
      return pointer should be a kernel ram pointer instead of iomem one.
      
      For making the function clearer and supressing sparse warnings this patch
      do below two things:
      1. cast to (__force void *) for the return value of early_memremap
      2. add early_memunmap function and pass (__force void __iomem *) to iounmap
      
      From Boris:
        "Ingo told me yesterday, it makes sense too.  I'd guess we can try it.
         FWIW, all callers of early_memremap use the memory they get remapped
         as normal memory so we should be safe"
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b550f6f
    • C
      percpu: add raw_cpu_ops · b3ca1c10
      Christoph Lameter 提交于
      The kernel has never been audited to ensure that this_cpu operations are
      consistently used throughout the kernel.  The code generated in many
      places can be improved through the use of this_cpu operations (which
      uses a segment register for relocation of per cpu offsets instead of
      performing address calculations).
      
      The patch set also addresses various consistency issues in general with
      the per cpu macros.
      
      A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
         because checks are skipped. This is typically shown through a raw_
         prefix. So this patch set changes the places where __this_cpu_ptr()
         is used to raw_cpu_ptr().
      
      B. There has been the long term wish by some that __this_cpu operations
         would check for preemption. However, there are cases where preemption
         checks need to be skipped. This patch set adds raw_cpu operations that
         do not check for preemption and then adds preemption checks to the
         __this_cpu operations.
      
      C. The use of __get_cpu_var is always a reference to a percpu variable
         that can also be handled via a this_cpu operation. This patch set
         replaces all uses of __get_cpu_var with this_cpu operations.
      
      D. We can then use this_cpu RMW operations in various places replacing
         sequences of instructions by a single one.
      
      E. The use of this_cpu operations throughout will allow other arches than
         x86 to implement optimized references and RMV operations to work with
         per cpu local data.
      
      F. The use of this_cpu operations opens up the possibility to
         further optimize code that relies on synchronization through
         per cpu data.
      
      The patch set works in a couple of stages:
      
      I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
          Also converts the existing __this_cpu_xx_# primitive in the x86
          code to raw_cpu_xx_#.
      
      II. Patch 2-4 use the raw_cpu operations in places that would give
           us false positives once they are enabled.
      
      III. Patch 5 adds preemption checks to __this_cpu operations to allow
          checking if preemption is properly disabled when these functions
          are used.
      
      IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
         with this_cpu_ptr. They do not depend on any changes to the percpu
         code. No preemption tests are skipped if they are applied.
      
      V. Patches 21-46 are conversion patches that use this_cpu operations
         in various kernel subsystems/drivers or arch code.
      
      VI.  Patches 47/48 (not included in this series) remove no longer used
          functions (__this_cpu_ptr and __get_cpu_var).  These should only be
          applied after all the conversion patches have made it and after we
          have done additional passes through the kernel to ensure that none of
          the uses of these functions remain.
      
      This patch (of 46):
      
      The patches following this one will add preemption checks to __this_cpu
      ops so we need to have an alternative way to use this_cpu operations
      without preemption checks.
      
      raw_cpu_ops will be the basis for all other ops since these will be the
      operations that do not implement any checks.
      
      Primitive operations are renamed by this patch from __this_cpu_xxx to
      raw_cpu_xxxx.
      
      Also change the uses of the x86 percpu primitives in preempt.h.
      These depend directly on asm/percpu.h (header #include nesting issue).
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bryan Wu <cooloney@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wim Van Sebroeck <wim@iguana.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ca1c10
    • J
      x86: always define BUG() and HAVE_ARCH_BUG, even with !CONFIG_BUG · b06dd879
      Josh Triplett 提交于
      This ensures that BUG() always has a definition that causes a trap (via
      an undefined instruction), and that the compiler still recognizes the
      code following BUG() as unreachable, avoiding warnings that would
      otherwise appear (such as on non-void functions that don't return a
      value after BUG()).
      
      In addition to saving a few bytes over the generic infinite-loop
      implementation, this implementation traps rather than looping, which
      potentially allows for better error-recovery behavior (such as by
      rebooting).
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b06dd879
  7. 29 3月, 2014 1 次提交
    • A
      x86: fix boot on uniprocessor systems · 825600c0
      Artem Fetishev 提交于
      On x86 uniprocessor systems topology_physical_package_id() returns -1
      which causes rapl_cpu_prepare() to leave rapl_pmu variable uninitialized
      which leads to GPF in rapl_pmu_init().
      
      See arch/x86/kernel/cpu/perf_event_intel_rapl.c.
      
      It turns out that physical_package_id and core_id can actually be
      retreived for uniprocessor systems too.  Enabling them also fixes
      rapl_pmu code.
      Signed-off-by: NArtem Fetishev <artem_fetishev@epam.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      825600c0
  8. 25 3月, 2014 1 次提交
    • D
      Revert "xen: properly account for _PAGE_NUMA during xen pte translations" · 5926f87f
      David Vrabel 提交于
      This reverts commit a9c8e4be.
      
      PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
      is set and pseudo-physical addresses is _PAGE_PRESENT is clear.
      
      This is because during a domain save/restore (migration) the page
      table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
      converted to PFNs during domain save so that on a restore the page
      table entries may be rewritten with the new MFNs on the destination.
      This canonicalisation is only done for PTEs that are present.
      
      This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
      _PAGE_NUMA) was set but _PAGE_PRESENT was clear.  These PTEs would be
      migrated as-is which would result in unexpected behaviour in the
      destination domain.  Either a) the MFN would be translated to the
      wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
      because the MFN is no longer owned by the domain; or c) the present
      bit would not get set.
      
      Symptoms include "Bad page" reports when munmapping after migrating a
      domain.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>        [3.12+]
      5926f87f
  9. 21 3月, 2014 3 次提交
  10. 20 3月, 2014 4 次提交
    • E
      audit: use uapi/linux/audit.h for AUDIT_ARCH declarations · 579ec9e1
      Eric Paris 提交于
      The syscall.h headers were including linux/audit.h but really only
      needed the uapi/linux/audit.h to get the requisite defines.  Switch to
      the uapi headers.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@linux-mips.org
      Cc: linux-s390@vger.kernel.org
      Cc: x86@kernel.org
      579ec9e1
    • E
      syscall_get_arch: remove useless function arguments · 5e937a9a
      Eric Paris 提交于
      Every caller of syscall_get_arch() uses current for the task and no
      implementors of the function need args.  So just get rid of both of
      those things.  Admittedly, since these are inline functions we aren't
      wasting stack space, but it just makes the prototypes better.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@linux-mips.org
      Cc: linux390@de.ibm.com
      Cc: x86@kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      5e937a9a
    • H
      random: Add arch_has_random[_seed]() · 7b878d4b
      H. Peter Anvin 提交于
      Add predicate functions for having arch_get_random[_seed]*().  The
      only current use is to avoid the loop in arch_random_refill() when
      arch_get_random_seed_long() is unavailable.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7b878d4b
    • H
      x86, random: Enable the RDSEED instruction · d20f78d2
      H. Peter Anvin 提交于
      Upcoming Intel silicon adds a new RDSEED instruction, which is similar
      to RDRAND but provides a stronger guarantee: unlike RDRAND, RDSEED
      will always reseed the PRNG from the true random number source between
      each read.  Thus, the output of RDSEED is guaranteed to be 100%
      entropic, unlike RDRAND which is only architecturally guaranteed to be
      1/512 entropic (although in practice is much more.)
      
      The RDSEED instruction takes the same time to execute as RDRAND, but
      RDSEED unlike RDRAND can legitimately return failure (CF=0) due to
      entropy exhaustion if too many threads on too many cores are hammering
      the RDSEED instruction at the same time.  Therefore, we have to be
      more conservative and only use it in places where we can tolerate
      failures.
      
      This patch introduces the primitives arch_get_random_seed_{int,long}()
      but does not use it yet.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d20f78d2
  11. 19 3月, 2014 5 次提交
  12. 18 3月, 2014 1 次提交
  13. 14 3月, 2014 3 次提交
  14. 13 3月, 2014 1 次提交
  15. 12 3月, 2014 1 次提交
  16. 11 3月, 2014 5 次提交
  17. 07 3月, 2014 4 次提交
  18. 06 3月, 2014 2 次提交