1. 11 9月, 2015 1 次提交
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  2. 09 9月, 2015 5 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • M
      x86: use generic early mem copy · 5dd2c4bd
      Mark Salter 提交于
      The early_ioremap library now has a generic copy_from_early_mem()
      function.  Use the generic copy function for x86 relocate_initrd().
      
      [akpm@linux-foundation.org: remove MAX_MAP_CHUNK define, per Yinghai Lu]
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dd2c4bd
    • M
      arm64: support initrd outside kernel linear map · 1570f0d7
      Mark Salter 提交于
      The use of mem= could leave part or all of the initrd outside of the
      kernel linear map.  This will lead to an error when unpacking the initrd
      and a probable failure to boot.  This patch catches that situation and
      relocates the initrd to be fully within the linear map.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1570f0d7
    • T
      mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
      Tang Chen 提交于
      When parsing SRAT, all memory ranges are added into numa_meminfo.  In
      numa_init(), before entering numa_cleanup_meminfo(), all possible memory
      ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
      ranges over max_pfn or empty.
      
      But, this only works if the nodes are continuous.  Let's have a look at
      the following example:
      
      We have an SRAT like this:
      SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
      SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
      SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
      SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
      SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
      SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
      SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
      SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
      SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
      
      On boot, only node 0,1,2,3 exist.
      
      And the numa_meminfo will look like this:
      numa_meminfo.nr_blks = 9
      1. on node 0: [0, 60000000]
      2. on node 0: [100000000, 20000000000]
      3. on node 1: [20000000000, 40000000000]
      4. on node 4: [40000000000, 60000000000]
      5. on node 5: [60000000000, 80000000000]
      6. on node 2: [80000000000, a0000000000]
      7. on node 3: [a0000000000, a0800000000]
      8. on node 6: [c0000000000, a0800000000]
      9. on node 7: [e0000000000, a0800000000]
      
      And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
      end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
      removed because their end addresses are less then max_pfn.  But in fact,
      node 4 and 5 don't exist.
      
      In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
      
      Since memory ranges in node 4 and 5 are in numa_meminfo, in
      numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
      
      If you run lscpu, it will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node2 CPU(s):
      NUMA node3 CPU(s):
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      
      In this patch, we use memblock_overlaps_region() to check if ranges in
      numa_meminfo overlap with ranges in memory_block.  Since memory_block
      contains all available memory at boot time, if they overlap, it means the
      ranges exist.  If not, then remove them from numa_meminfo.
      
      After this patch, lscpu will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95cf82ec
    • M
      sparc32: do not include swap.h from pgtable_32.h · b3d9ed3f
      Michal Hocko 提交于
      "memcg: export struct mem_cgroup" will add includes into
      linux/memcontrol.h which lead to further header dependency issues as
      reported by Guenter Roeck:
      
        In file included from include/linux/highmem.h:7:0,
                         from include/linux/bio.h:23,
                         from include/linux/writeback.h:192,
                         from include/linux/memcontrol.h:30,
                         from include/linux/swap.h:8,
                         from ./arch/sparc/include/asm/pgtable_32.h:17,
                         from ./arch/sparc/include/asm/pgtable.h:6,
                         from arch/sparc/kernel/traps_32.c:23:
        include/linux/mm.h: In function 'is_vmalloc_addr':
        include/linux/mm.h:371:17: error: 'VMALLOC_START' undeclared (first use in this function)
        include/linux/mm.h:371:17: note: each undeclared identifier is reported only once for each function it appears in
        include/linux/mm.h:371:41: error: 'VMALLOC_END' undeclared (first use in this function)
        include/linux/mm.h: In function 'maybe_mkwrite':
        include/linux/mm.h:556:3: error: implicit declaration of function 'pte_mkwrite'
      
      The issue is that pgtable_32.h depends on swap.h to get swap_entry_t but
      that goes all the way down to linux/mm.h which wants to have VMALLOC_*
      which is defined later in pgtable_32.h, though.
      
      swap_entry_t is defined in include/mm_types.h so it should be sufficient
      to include this header without more dependencies.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3d9ed3f
  3. 08 9月, 2015 10 次提交
    • H
      6dc0dcde
    • H
      parisc: Drop CONFIG_SMP around update_cr16_clocksource() · 72581cec
      Helge Deller 提交于
      No need to use CONFIG_SMP around update_cr16_clocksource(). It checks for
      num_online_cpus() beeing greater than 1, which is always 1 in UP builds.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      72581cec
    • J
      xen: switch extra memory accounting to use pfns · 626d7508
      Juergen Gross 提交于
      Instead of using physical addresses for accounting of extra memory
      areas available for ballooning switch to pfns as this is much less
      error prone regarding partial pages.
      Reported-by: NRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      626d7508
    • J
      xen: limit memory to architectural maximum · cb9e444b
      Juergen Gross 提交于
      When a pv-domain (including dom0) is started it tries to size it's
      p2m list according to the maximum possible memory amount it ever can
      achieve. Limit the initial maximum memory size to the architectural
      limit of the hardware in order to avoid overflows during remapping
      of memory.
      
      This problem will occur when dom0 is started with an initial memory
      size being a multiple of 1GB, but without specifying it's maximum
      memory size. The kernel must be configured without
      CONFIG_XEN_BALLOON_MEMORY_HOTPLUG for the problem to happen.
      Reported-by: NRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      cb9e444b
    • J
      xen: avoid another early crash of memory limited dom0 · ab24507c
      Juergen Gross 提交于
      Commit b1c9f169047b ("xen: split counting of extra memory pages...")
      introduced an error when dom0 was started with limited memory occurring
      only on some hardware.
      
      The problem arises in case dom0 is started with initial memory and
      maximum memory being the same. The kernel must be configured without
      CONFIG_XEN_BALLOON_MEMORY_HOTPLUG for the problem to happen. If all
      of this is true and the E820 map of the machine is sparse (some areas
      are not covered) then the machine might crash early in the boot
      process.
      
      An example E820 map triggering the problem looks like this:
      
      [    0.000000] e820: BIOS-provided physical RAM map:
      [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
      [    0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000cf7fafff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000cf7fb000-0x00000000cf95ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000cf960000-0x00000000cfb62fff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfb63000-0x00000000cfd14fff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000cfd15000-0x00000000cfd61fff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfd62000-0x00000000cfd6cfff] ACPI data
      [    0.000000] BIOS-e820: [mem 0x00000000cfd6d000-0x00000000cfd6ffff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfd70000-0x00000000cfd70fff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000cfd71000-0x00000000cfea8fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000cfea9000-0x00000000cfeb9fff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfeba000-0x00000000cfecafff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000cfecb000-0x00000000cfecbfff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfecc000-0x00000000cfedbfff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000cfedc000-0x00000000cfedcfff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfedd000-0x00000000cfeddfff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000cfede000-0x00000000cfee3fff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000cfee4000-0x00000000cfef6fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000cfef7000-0x00000000cfefffff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed61000-0x00000000fed70fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed8ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000100001000-0x000000020effffff] usable
      
      In this case the area a0000-dffff isn't present in the map. This will
      confuse the memory setup of the domain when remapping the memory from
      such holes to populated areas.
      
      To avoid the problem the accounting of to be remapped memory has to
      count such holes in the E820 map as well.
      Reported-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      ab24507c
    • J
      xen: avoid early crash of memory limited dom0 · eafd72e0
      Juergen Gross 提交于
      Commit b1c9f169047b ("xen: split counting of extra memory pages...")
      introduced an error when dom0 was started with limited memory.
      
      The problem arises in case dom0 is started with initial memory and
      maximum memory being the same and exactly a multiple of 1 GB. The
      kernel must be configured without CONFIG_XEN_BALLOON_MEMORY_HOTPLUG
      for the problem to happen. In this case it will crash very early
      during boot due to the virtual mapped p2m list not being large
      enough to be able to remap any memory:
      
      (XEN) Freed 304kB init memory.
      mapping kernel into physical memory
      about to get started...
      (XEN) traps.c:459:d0v0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
      (XEN) domain_crash_sync called from entry.S: fault at ffff82d080229a93 create_bounce_frame+0x12b/0x13a
      (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
      (XEN) ----[ Xen-4.5.2-pre  x86_64  debug=n Not tainted ]----
      (XEN) CPU:    0
      (XEN) RIP:    e033:[<ffffffff81d120cb>]
      (XEN) RFLAGS: 0000000000000206   EM: 1 CONTEXT: pv guest (d0v0)
      (XEN) rax: ffffffff81db2000   rbx: 000000004d000000   rcx: 0000000000000000
      (XEN) rdx: 000000004d000000   rsi: 0000000000063000   rdi: 000000004d063000
      (XEN) rbp: ffffffff81c03d78   rsp: ffffffff81c03d28   r8:  0000000000023000
      (XEN) r9:  00000001040ff000   r10: 0000000000007ff0   r11: 0000000000000000
      (XEN) r12: 0000000000063000   r13: 000000000004d000   r14: 0000000000000063
      (XEN) r15: 0000000000000063   cr0: 0000000080050033   cr4: 00000000000006f0
      (XEN) cr3: 0000000105c0f000   cr2: ffffc90000268000
      (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
      (XEN) Guest stack trace from rsp=ffffffff81c03d28:
      (XEN)   0000000000000000 0000000000000000 ffffffff81d120cb 000000010000e030
      (XEN)   0000000000010006 ffffffff81c03d68 000000000000e02b ffffffffffffffff
      (XEN)   0000000000000063 000000000004d063 ffffffff81c03de8 ffffffff81d130a7
      (XEN)   ffffffff81c03de8 000000000004d000 00000001040ff000 0000000000105db1
      (XEN)   00000001040ff001 000000000004d062 ffff8800092d6ff8 0000000002027000
      (XEN)   ffff8800094d8340 ffff8800092d6ff8 00003ffffffff000 ffff8800092d7ff8
      (XEN)   ffffffff81c03e48 ffffffff81d13c43 ffff8800094d8000 ffff8800094d9000
      (XEN)   0000000000000000 ffff8800092d6000 00000000092d6000 000000004cfbf000
      (XEN)   00000000092d6000 00000000052d5442 0000000000000000 0000000000000000
      (XEN)   ffffffff81c03ed8 ffffffff81d185c1 0000000000000000 0000000000000000
      (XEN)   ffffffff81c03e78 ffffffff810f8ca4 ffffffff81c03ed8 ffffffff8171a15d
      (XEN)   0000000000000010 ffffffff81c03ee8 0000000000000000 0000000000000000
      (XEN)   ffffffff81f0e402 ffffffffffffffff ffffffff81dae900 0000000000000000
      (XEN)   0000000000000000 0000000000000000 ffffffff81c03f28 ffffffff81d0cf0f
      (XEN)   0000000000000000 0000000000000000 0000000000000000 ffffffff81db82e0
      (XEN)   0000000000000000 0000000000000000 0000000000000000 0000000000000000
      (XEN)   ffffffff81c03f38 ffffffff81d0c603 ffffffff81c03ff8 ffffffff81d11c86
      (XEN)   0300000100000032 0000000000000005 0000000000000020 0000000000000000
      (XEN)   0000000000000000 0000000000000000 0000000000000000 0000000000000000
      (XEN)   0000000000000000 0000000000000000 0000000000000000 0000000000000000
      (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
      
      This can be avoided by allocating aneough space for the p2m to cover
      the maximum memory of dom0 plus the identity mapped holes required
      for PCI space, BIOS etc.
      Reported-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      eafd72e0
    • J
      parisc: Use double word condition in 64bit CAS operation · 1b59ddfc
      John David Anglin 提交于
      The attached change fixes the condition used in the "sub" instruction.
      A double word comparison is needed.  This fixes the 64-bit LWS CAS
      operation on 64-bit kernels.
      
      I can now enable 64-bit atomic support in GCC.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: John David Anglin <dave.anglin>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      1b59ddfc
    • H
      parisc: Filter out spurious interrupts in PA-RISC irq handler · b1b4e435
      Helge Deller 提交于
      When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
      long way to go to find the right IRQ line, registering it, then registering the
      serial port and the irq handler for the serial port. During this phase spurious
      interrupts for the serial port may happen which then crashes the kernel because
      the action handler might not have been set up yet.
      
      So, basically it's a race condition between the serial port hardware and the
      CPU which sets up the necessary fields in the irq sructs. The main reason for
      this race is, that we unmask the serial port irqs too early without having set
      up everything properly before (which isn't easily possible because we need the
      IRQ number to register the serial ports).
      
      This patch is a work-around for this problem. It adds checks to the CPU irq
      handler to verify if the IRQ action field has been initialized already. If not,
      we just skip this interrupt (which isn't critical for a serial port at bootup).
      The real fix would probably involve rewriting all PA-RISC specific IRQ code
      (for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
      the irq chips and proper irq enabling along this line.
      
      This bug has been in the PA-RISC port since the beginning, but the crashes
      happened very rarely with currently used hardware.  But on the latest machine
      which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
      1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
      crashed at every boot because of this race. So, without this patch the machine
      would currently be unuseable.
      
      For the record, here is the flow logic:
      1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
      2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
      3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
      4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
      5. serial_init_chip() then registers the 8250 port.
      Problems:
      - In step 4 the CPU irq shouldn't have been registered yet, but after step 5
      - If serial irq happens between 4 and 5 have finished, the kernel will crash
      Signed-off-by: NHelge Deller <deller@gmx.de>
      b1b4e435
    • H
      parisc: Additionally check for in_atomic() in page fault handler · 699817c3
      Helge Deller 提交于
      Craig Estey noticed that we didn't checked for in_atomic() in our page fault
      handler like other architectures. This commit adds this check by using
      faulthandler_disabled() which includes a check for pagefault_disabled() and
      in_atomic().
      Reported-by: NCraig Estey <cae370@gmail.com>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      699817c3
    • G
      parisc: Define ioremap_uc and ioremap_wc · 38d9029a
      Guenter Roeck 提交于
      Commit 3cc2dac5 ("drivers/video/fbdev/atyfb: Replace MTRR UC hole
      with strong UC") introduces calls to ioremap_wc and ioremap_uc. This
      causes build failures with parisc:allmodconfig. Map the missing
      functions to ioremap_nocache.
      
      Fixes: 3cc2dac5 ("drivers/video/fbdev/atyfb:
      	Replace MTRR UC hole with strong UC")
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      38d9029a
  4. 06 9月, 2015 6 次提交
  5. 05 9月, 2015 8 次提交
    • V
      genalloc: add name arg to gen_pool_get() and devm_gen_pool_create() · 73858173
      Vladimir Zapolskiy 提交于
      This change modifies gen_pool_get() and devm_gen_pool_create() client
      interfaces adding one more argument "name" of a gen_pool object.
      
      Due to implementation gen_pool_get() is capable to retrieve only one
      gen_pool associated with a device even if multiple gen_pools are created,
      fortunately right at the moment it is sufficient for the clients, hence
      provide NULL as a valid argument on both producer devm_gen_pool_create()
      and consumer gen_pool_get() sides.
      
      Because only one created gen_pool per device is addressable, explicitly
      add a restriction to devm_gen_pool_create() to create only one gen_pool
      per device, this implies two possible error codes returned by the
      function, account it on client side (only misc/sram).  This completes
      client side changes related to genalloc updates.
      
      [akpm@linux-foundation.org: gen_pool_get() cleanup]
      Signed-off-by: NVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
      Cc: Philipp Zabel <p.zabel@pengutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Sascha Hauer <kernel@pengutronix.de>
      Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73858173
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
    • M
      x86, mm: trace when an IPI is about to be sent · 5b74283a
      Mel Gorman 提交于
      When unmapping pages it is necessary to flush the TLB.  If that page was
      accessed by another CPU then an IPI is used to flush the remote CPU.  That
      is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
      second.
      
      There already is a window between when a page is unmapped and when it is
      TLB flushed.  This series increases the window so multiple pages can be
      flushed using a single IPI.  This should be safe or the kernel is hosed
      already.
      
      Patch 1 simply made the rest of the series easier to write as ftrace
              could identify all the senders of TLB flush IPIS.
      
      Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
              to flush the entire TLB.
      
      Patch 3 tracks when there potentially are writable TLB entries that
              need to be batched differently
      
      Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes
      
      The performance impact is documented in the changelogs but in the optimistic
      case on a 4-socket machine the full series reduces interrupts from 900K
      interrupts/second to 60K interrupts/second.
      
      This patch (of 4):
      
      It is easy to trace when an IPI is received to flush a TLB but harder to
      detect what event sent it.  This patch makes it easy to identify the
      source of IPIs being transmitted for TLB flushes on x86.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b74283a
    • A
      userfaultfd: activate syscall · 1380fca0
      Andrea Arcangeli 提交于
      This activates the userfaultfd syscall.
      
      [sfr@canb.auug.org.au: activate syscall fix]
      [akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1380fca0
    • U
      watchdog: rename watchdog_suspend() and watchdog_resume() · ec6a9066
      Ulrich Obergfell 提交于
      Rename watchdog_suspend() to lockup_detector_suspend() and
      watchdog_resume() to lockup_detector_resume() to avoid confusion with the
      watchdog subsystem and to be consistent with the existing name
      lockup_detector_init().
      
      Also provide comment blocks to explain the watchdog_running and
      watchdog_suspended variables and their relationship.
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec6a9066
    • U
      watchdog: use suspend/resume interface in fixup_ht_bug() · 999bbe49
      Ulrich Obergfell 提交于
      Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since
      these functions are no longer needed.  If a subsystem has a need to
      deactivate the watchdog temporarily, it should utilize the
      watchdog_suspend() and watchdog_resume() functions.
      
      [akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m]
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      999bbe49
    • G
      kernel/watchdog: move NMI function header declarations from watchdog.h to nmi.h · aacfbe6a
      Guenter Roeck 提交于
      The kernel's NMI watchdog has nothing to do with the watchdog subsystem.
      Its header declarations should be in linux/nmi.h, not linux/watchdog.h.
      
      The code provided two sets of dummy functions if HARDLOCKUP_DETECTOR is
      not configured, one in the include file and one in kernel/watchdog.c.
      Remove the dummy functions from kernel/watchdog.c and use those from the
      include file.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aacfbe6a
    • A
      sh: use PFN_DOWN macro · 81cf09ed
      Alexander Kuleshov 提交于
      Replace ((x) >> PAGE_SHIFT) with the predefined PFN_DOWN macro.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81cf09ed
  6. 04 9月, 2015 2 次提交
  7. 03 9月, 2015 8 次提交