1. 07 7月, 2017 3 次提交
    • M
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko 提交于
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • M
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko 提交于
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • M
      mm, memory_hotplug: get rid of is_zone_device_section · 1b862aec
      Michal Hocko 提交于
      Device memory hotplug hooks into regular memory hotplug only half way.
      It needs memory sections to track struct pages but there is no
      need/desire to associate those sections with memory blocks and export
      them to the userspace via sysfs because they cannot be onlined anyway.
      
      This is currently expressed by for_device argument to arch_add_memory
      which then makes sure to associate the given memory range with
      ZONE_DEVICE.  register_new_memory then relies on is_zone_device_section
      to distinguish special memory hotplug from the regular one.  While this
      works now, later patches in this series want to move __add_zone outside
      of arch_add_memory path so we have to come up with something else.
      
      Add want_memblock down the __add_pages path and use it to control
      whether the section->memblock association should be done.
      arch_add_memory then just trivially want memblock for everything but
      for_device hotplug.
      
      remove_memory_section doesn't need is_zone_device_section either.  We
      can simply skip all the memblock specific cleanup if there is no
      memblock for the given section.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b862aec
  2. 29 6月, 2017 1 次提交
  3. 28 6月, 2017 1 次提交
  4. 26 6月, 2017 1 次提交
    • M
      powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user() · d6bd8194
      Michael Ellerman 提交于
      Larry Finger reported that his Powerbook G4 was no longer booting with v4.12-rc,
      userspace was up but giving weird errors such as:
      
        udevd[64]: starting version 175
        udevd[64]: Unable to receive ctrl message: Bad address.
        modprobe: chdir(4.12-rc1): No such file or directory
      
      He bisected the problem to commit 3448890c ("powerpc: get rid of zeroing,
      switch to RAW_COPY_USER").
      
      Al identified that the problem is actually a miscompilation by GCC 4.6.3, which
      is exposed by the above commit.
      
      Al also pointed out that inlining copy_to/from_user() is probably of little or
      no benefit, which is correct. Using Anton's copy_to_user benchmark, with a
      pathological single byte copy, we see a small increase in performance
      by *removing* inlining:
      
        Before (inlined):
        # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
        real	0m22.063s
        real	0m22.059s
        real	0m22.076s
      
        After:
        # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
        real	0m21.325s
        real	0m21.299s
        real	0m21.364s
      
      So as a small performance improvement and to avoid the miscompilation, drop
      inlining copy_to/from_user() on 32-bit.
      
      Fixes: 3448890c ("powerpc: get rid of zeroing, switch to RAW_COPY_USER")
      Reported-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6bd8194
  5. 23 6月, 2017 2 次提交
  6. 22 6月, 2017 1 次提交
  7. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  8. 16 6月, 2017 8 次提交
    • R
      powerpc/perf: Fix oops when kthread execs user process · bf05fc25
      Ravi Bangoria 提交于
      When a kthread calls call_usermodehelper() the steps are:
        1. allocate current->mm
        2. load_elf_binary()
        3. populate current->thread.regs
      
      While doing this, interrupts are not disabled. If there is a perf
      interrupt in the middle of this process (i.e. step 1 has completed
      but not yet reached to step 3) and if perf tries to read userspace
      regs, kernel oops with following log:
      
        Unable to handle kernel paging request for data at address 0x00000000
        Faulting instruction address: 0xc0000000000da0fc
        ...
        Call Trace:
        perf_output_sample_regs+0x6c/0xd0
        perf_output_sample+0x4e4/0x830
        perf_event_output_forward+0x64/0x90
        __perf_event_overflow+0x8c/0x1e0
        record_and_restart+0x220/0x5c0
        perf_event_interrupt+0x2d8/0x4d0
        performance_monitor_exception+0x54/0x70
        performance_monitor_common+0x158/0x160
        --- interrupt: f01 at avtab_search_node+0x150/0x1a0
            LR = avtab_search_node+0x100/0x1a0
        ...
        load_elf_binary+0x6e8/0x15a0
        search_binary_handler+0xe8/0x290
        do_execveat_common.isra.14+0x5f4/0x840
        call_usermodehelper_exec_async+0x170/0x210
        ret_from_kernel_thread+0x5c/0x7c
      
      Fix it by setting abi to PERF_SAMPLE_REGS_ABI_NONE when userspace
      pt_regs are not set.
      
      Fixes: ed4a4ef8 ("powerpc/perf: Add support for sampling interrupt register state")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bf05fc25
    • N
      powerpc/64s: Handle data breakpoints in Radix mode · d89ba535
      Naveen N. Rao 提交于
      On Power9, trying to use data breakpoints throws the splat shown
      below. This is because the check for a data breakpoint in DSISR is in
      do_hash_page(), which is not called when in Radix mode.
      
        Unable to handle kernel paging request for data at address 0xc000000000e19218
        Faulting instruction address: 0xc0000000001155e8
        cpu 0x0: Vector: 300 (Data Access) at [c0000000ef1e7b20]
        pc: c0000000001155e8: find_pid_ns+0x48/0xe0
        lr: c000000000116ac4: find_task_by_vpid+0x44/0x90
        sp: c0000000ef1e7da0
        msr: 9000000000009033
        dar: c000000000e19218
        dsisr: 400000
      
      Move the check to handle_page_fault() so as to catch data breakpoints
      in both Hash and Radix MMU modes.
      
      We have to change the check in do_hash_page() against 0xa410 to use
      0xa450, so as to include the value of (DSISR_DABRMATCH << 16).
      
      There are two sites that call handle_page_fault() when in Radix, both
      already pass DSISR in r4.
      
      Fixes: caca285e ("powerpc/mm/radix: Use STD_MMU_64 to properly isolate hash related code")
      Cc: stable@vger.kernel.org # v4.7+
      Reported-by: NShriya R. Kulkarni <shriykul@in.ibm.com>
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      [mpe: Fix the fall-through case on hash, we need to reload DSISR]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d89ba535
    • N
      powerpc/kprobes: Skip livepatch_handler() for jprobes · c05b8c44
      Naveen N. Rao 提交于
      ftrace_caller() depends on a modified regs->nip to detect if a certain
      function has been livepatched. However, with KPROBES_ON_FTRACE, it is
      possible for regs->nip to have been modified by the kprobes pre_handler
      (jprobes, for instance). In this case, we do not want to invoke the
      livepatch_handler so as not to consume the livepatch stack.
      
      To distinguish between the two (kprobes and livepatch), we check if
      there is an active kprobe on the current function. If there is, then we
      know for sure that it must have modified the NIP as we don't support
      livepatching a kprobe'd function. In this case, we simply skip the
      livepatch_handler and branch to the new NIP. Otherwise, the
      livepatch_handler is invoked.
      
      Fixes: ead514d5 ("powerpc/kprobes: Add support for KPROBES_ON_FTRACE")
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c05b8c44
    • N
      powerpc/ftrace: Pass the correct stack pointer for DYNAMIC_FTRACE_WITH_REGS · a4979a7e
      Naveen N. Rao 提交于
      For DYNAMIC_FTRACE_WITH_REGS, we should be passing-in the original set
      of registers in pt_regs, to capture the state _before_ ftrace_caller.
      However, we are instead passing the stack pointer *after* allocating a
      stack frame in ftrace_caller. Fix this by saving the proper value of r1
      in pt_regs. Also, use SAVE_10GPRS() to simplify the code.
      
      Fixes: 15308664 ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI")
      Cc: stable@vger.kernel.org # v4.6+
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a4979a7e
    • N
      powerpc/kprobes: Pause function_graph tracing during jprobes handling · a9f8553e
      Naveen N. Rao 提交于
      This fixes a crash when function_graph and jprobes are used together.
      This is essentially commit 237d28db ("ftrace/jprobes/x86: Fix
      conflict between jprobes and function graph tracing"), but for powerpc.
      
      Jprobes breaks function_graph tracing since the jprobe hook needs to use
      jprobe_return(), which never returns back to the hook, but instead to
      the original jprobe'd function. The solution is to momentarily pause
      function_graph tracing before invoking the jprobe hook and re-enable it
      when returning back to the original jprobe'd function.
      
      Fixes: 6794c782 ("powerpc64: port of the function graph tracer")
      Cc: stable@vger.kernel.org # v2.6.30+
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a9f8553e
    • A
      powerpc/debug: Add missing warn flag to WARN_ON's non-builtin path · a093c92d
      Alexey Kardashevskiy 提交于
      When trapped on WARN_ON(), report_bug() is expected to return
      BUG_TRAP_TYPE_WARN so the caller will increment NIP by 4 and continue.
      The __builtin_constant_p() path of the PPC's WARN_ON()
      calls (indirectly) __WARN_FLAGS() which has BUGFLAG_WARNING set,
      however the other branch does not which makes report_bug() report a
      bug rather than a warning.
      
      Fixes: f26dee15 ("debug: Avoid setting BUGFLAG_WARNING twice")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a093c92d
    • P
      KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1 · 3d3efb68
      Paul Mackerras 提交于
      POWER9 DD1 has an erratum where writing to the TBU40 register, which
      is used to apply an offset to the timebase, can cause the timebase to
      lose counts.  This results in the timebase on some CPUs getting out of
      sync with other CPUs, which then results in misbehaviour of the
      timekeeping code.
      
      To work around the problem, we make KVM ignore the timebase offset for
      all guests on POWER9 DD1 machines.  This means that live migration
      cannot be supported on POWER9 DD1 machines.
      
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3d3efb68
    • P
      KVM: PPC: Book3S HV: Save/restore host values of debug registers · 7ceaa6dc
      Paul Mackerras 提交于
      At present, HV KVM on POWER8 and POWER9 machines loses any instruction
      or data breakpoint set in the host whenever a guest is run.
      Instruction breakpoints are currently only used by xmon, but ptrace
      and the perf_event subsystem can set data breakpoints as well as xmon.
      
      To fix this, we save the host values of the debug registers (CIABR,
      DAWR and DAWRX) before entering the guest and restore them on exit.
      To provide space to save them in the stack frame, we expand the stack
      frame allocated by kvmppc_hv_entry() from 112 to 144 bytes.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7ceaa6dc
  9. 15 6月, 2017 3 次提交
    • B
      powerpc/xive: Fix offset for store EOI MMIOs · 25642705
      Benjamin Herrenschmidt 提交于
      Architecturally we should apply a 0x400 offset for these. Not doing
      it will break future HW implementations.
      
      The offset of 0 is supposed to remain for "triggers" though not all
      sources support both trigger and store EOI, and in P9 specifically,
      some sources will treat 0 as a store EOI. But future chips will not.
      So this makes us use the properly architected offset which should work
      always.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      25642705
    • P
      KVM: PPC: Book3S HV: Preserve userspace HTM state properly · 46a704f8
      Paul Mackerras 提交于
      If userspace attempts to call the KVM_RUN ioctl when it has hardware
      transactional memory (HTM) enabled, the values that it has put in the
      HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by
      guest values.  To fix this, we detect this condition and save those
      SPR values in the thread struct, and disable HTM for the task.  If
      userspace goes to access those SPRs or the HTM facility in future,
      a TM-unavailable interrupt will occur and the handler will reload
      those SPRs and re-enable HTM.
      
      If userspace has started a transaction and suspended it, we would
      currently lose the transactional state in the guest entry path and
      would almost certainly get a "TM Bad Thing" interrupt, which would
      cause the host to crash.  To avoid this, we detect this case and
      return from the KVM_RUN ioctl with an EINVAL error, with the KVM
      exit reason set to KVM_EXIT_FAIL_ENTRY.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      46a704f8
    • P
      KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit · 4c3bb4cc
      Paul Mackerras 提交于
      This restores several special-purpose registers (SPRs) to sane values
      on guest exit that were missed before.
      
      TAR and VRSAVE are readable and writable by userspace, and we need to
      save and restore them to prevent the guest from potentially affecting
      userspace execution (not that TAR or VRSAVE are used by any known
      program that run uses the KVM_RUN ioctl).  We save/restore these
      in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.
      
      FSCR affects userspace execution in that it can prohibit access to
      certain facilities by userspace.  We restore it to the normal value
      for the task on exit from the KVM_RUN ioctl.
      
      IAMR is normally 0, and is restored to 0 on guest exit.  However,
      with a radix host on POWER9, it is set to a value that prevents the
      kernel from executing user-accessible memory.  On POWER9, we save
      IAMR on guest entry and restore it on guest exit to the saved value
      rather than 0.  On POWER8 we continue to set it to 0 on guest exit.
      
      PSPB is normally 0.  We restore it to 0 on guest exit to prevent
      userspace taking advantage of the guest having set it non-zero
      (which would allow userspace to set its SMT priority to high).
      
      UAMOR is normally 0.  We restore it to 0 on guest exit to prevent
      the AMR from being used as a covert channel between userspace
      processes, since the AMR is not context-switched at present.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4c3bb4cc
  10. 14 6月, 2017 2 次提交
    • A
      powerpc/npu-dma: Remove spurious WARN_ON when a PCI device has no of_node · 377aa6b0
      Alistair Popple 提交于
      Commit 4c3b89ef ("powerpc/powernv: Add sanity checks to
      pnv_pci_get_{gpu|npu}_dev") introduced explicit warnings in
      pnv_pci_get_npu_dev() when a PCIe device has no associated device-tree
      node. However not all PCIe devices have an of_node and
      pnv_pci_get_npu_dev() gets indirectly called at least once for every
      PCIe device in the system. This results in spurious WARN_ON()'s so
      remove it.
      
      The same situation should not exist for pnv_pci_get_gpu_dev() as any
      NPU based PCIe device requires a device-tree node.
      
      Fixes: 4c3b89ef ("powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev")
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      377aa6b0
    • T
      ibmvnic: Client-initiated failover · 40c9db8a
      Thomas Falcon 提交于
      The IBM vNIC protocol provides support for the user to initiate
      a failover from the client LPAR in case the current backing infrastructure
      is deemed inadequate or in an error state.
      
      Support for two H_VIOCTL sub-commands for vNIC devices are required
      to implement this function. These commands are H_GET_SESSION_TOKEN
      and H_SESSION_ERR_DETECTED.
      
      "[H_GET_SESSION_TOKEN] is used to obtain a session token from a VNIC client
      adapter.  This token is opaque to the caller and is intended to be used in
      tandem with the SESSION_ERROR_DETECTED vioctl subfunction."
      
      "[H_SESSION_ERR_DETECTED] is used to report that the currently active
      backing device for a VNIC client adapter is behaving poorly, and that
      the hypervisor should attempt to fail over to a different backing device,
      if one is available."
      
      To provide tools access to this functionality the vNIC driver creates a
      sysfs file that, when written to, will send a request to pHyp to failover
      to a different backing device.
      Signed-off-by: NThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40c9db8a
  11. 13 6月, 2017 2 次提交
    • K
      x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation · e585513b
      Kirill A. Shutemov 提交于
      This patch provides all required callbacks required by the generic
      get_user_pages_fast() code and switches x86 over - and removes
      the platform specific implementation.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e585513b
    • P
      KVM: PPC: Book3S HV: Context-switch EBB registers properly · ca8efa1d
      Paul Mackerras 提交于
      This adds code to save the values of three SPRs (special-purpose
      registers) used by userspace to control event-based branches (EBBs),
      which are essentially interrupts that get delivered directly to
      userspace.  These registers are loaded up with guest values when
      entering the guest, and their values are saved when exiting the
      guest, but we were not saving the host values and restoring them
      before going back to userspace.
      
      On POWER8 this would only affect userspace programs which explicitly
      request the use of EBBs and also use the KVM_RUN ioctl, since the
      only source of EBBs on POWER8 is the PMU, and there is an explicit
      enable bit in the PMU registers (and those PMU registers do get
      properly context-switched between host and guest).  On POWER9 there
      is provision for externally-generated EBBs, and these are not subject
      to the control in the PMU registers.
      
      Since these registers only affect userspace, we can save them when
      we first come in from userspace and restore them before returning to
      userspace, rather than saving/restoring the host values on every
      guest entry/exit.  Similarly, we don't need to worry about their
      values on offline secondary threads since they execute in the context
      of the idle task, which never executes in userspace.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ca8efa1d
  12. 12 6月, 2017 2 次提交
    • G
      powerpc: vio_cmo: use dev_groups and not dev_attrs for bus_type · 205a1ee1
      Greg Kroah-Hartman 提交于
      The dev_attrs field has long been "depreciated" and is finally being
      removed, so move the driver to use the "correct" dev_groups field
      instead for struct bus_type.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Johan Hovold <johan@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: <linuxppc-dev@lists.ozlabs.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      205a1ee1
    • G
      powerpc: vio: use dev_groups and not dev_attrs for bus_type · 451e3f1a
      Greg Kroah-Hartman 提交于
      The dev_attrs field has long been "depreciated" and is finally being
      removed, so move the driver to use the "correct" dev_groups field
      instead for struct bus_type.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Johan Hovold <johan@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: <linuxppc-dev@lists.ozlabs.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      451e3f1a
  13. 09 6月, 2017 4 次提交
    • A
      tty: add TIOCGPTPEER ioctl · 54ebbfb1
      Aleksa Sarai 提交于
      When opening the slave end of a PTY, it is not possible for userspace to
      safely ensure that /dev/pts/$num is actually a slave (in cases where the
      mount namespace in which devpts was mounted is controlled by an
      untrusted process). In addition, there are several unresolvable
      race conditions if userspace were to attempt to detect attacks through
      stat(2) and other similar methods [in addition it is not clear how
      userspace could detect attacks involving FUSE].
      
      Resolve this by providing an interface for userpace to safely open the
      "peer" end of a PTY file descriptor by using the dentry cached by
      devpts. Since it is not possible to have an open master PTY without
      having its slave exposed in /dev/pts this interface is safe. This
      interface currently does not provide a way to get the master pty (since
      it is not clear whether such an interface is safe or even useful).
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Valentin Rothberg <vrothberg@suse.com>
      Signed-off-by: NAleksa Sarai <asarai@suse.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54ebbfb1
    • G
      powerpc: ibmebus: use dev_groups and not dev_attrs for bus_type · 823a44ac
      Greg Kroah-Hartman 提交于
      The dev_attrs field has long been "depreciated" and is finally being
      removed, so move the driver to use the "correct" dev_groups field
      instead for struct bus_type.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Johan Hovold <johan@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Lars-Peter Clausen <lars@metafoo.de>
      Cc: <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      823a44ac
    • G
      powerpc: ps3: use dev_groups and not dev_attrs for bus_type · 892533e1
      Greg Kroah-Hartman 提交于
      The dev_attrs field has long been "depreciated" and is finally being
      removed, so move the driver to use the "correct" dev_groups field
      instead for struct bus_type.
      
      Seems-ok: Geoff Levand <geoff@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      892533e1
    • B
      security/keys: add CONFIG_KEYS_COMPAT to Kconfig · 47b2c3ff
      Bilal Amarni 提交于
      CONFIG_KEYS_COMPAT is defined in arch-specific Kconfigs and is missing for
      several 64-bit architectures : mips, parisc, tile.
      
      At the moment and for those architectures, calling in 32-bit userspace the
      keyctl syscall would return an ENOSYS error.
      
      This patch moves the CONFIG_KEYS_COMPAT option to security/keys/Kconfig, to
      make sure the compatibility wrapper is registered by default for any 64-bit
      architecture as long as it is configured with CONFIG_COMPAT.
      
      [DH: Modified to remove arm64 compat enablement also as requested by Eric
       Biggers]
      Signed-off-by: NBilal Amarni <bilal.amarni@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      cc: Eric Biggers <ebiggers3@gmail.com>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      47b2c3ff
  14. 08 6月, 2017 2 次提交
  15. 07 6月, 2017 2 次提交
  16. 06 6月, 2017 3 次提交
    • M
      powerpc/perf: Fix Power9 test_adder fields · 8c218578
      Madhavan Srinivasan 提交于
      Commit 8d911904 ('powerpc/perf: Add restrictions to PMC5 in power9 DD1')
      was added to restrict the use of PMC5 in Power9 DD1. Intention was to disable
      the use of PMC5 using raw event code. But instead of updating the
      power9_isa207_pmu structure (used on DD1), the commit incorrectly updated the
      power9_pmu structure. Fix it.
      
      Fixes: 8d911904 ("powerpc/perf: Add restrictions to PMC5 in power9 DD1")
      Reported-by: NShriya <shriyak@linux.vnet.ibm.com>
      Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Tested-by: NShriya <shriyak@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8c218578
    • M
      powerpc/numa: Fix percpu allocations to be NUMA aware · ba4a648f
      Michael Ellerman 提交于
      In commit 8c272261 ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
      switched to the generic implementation of cpu_to_node(), which uses a percpu
      variable to hold the NUMA node for each CPU.
      
      Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
      of our percpu areas, leading to a chicken and egg problem. In practice what
      happens is when we are setting up the percpu areas, cpu_to_node() reports that
      all CPUs are on node 0, so we allocate all percpu areas on node 0.
      
      This is visible in the dmesg output, as all pcpu allocs being in group 0:
      
        pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
        pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
        pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
        pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
        pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
        pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
      
      To fix it we need an early_cpu_to_node() which can run prior to percpu being
      setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
      in. With the patch dmesg output shows two groups, 0 and 1:
      
        pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
        pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
        pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
        pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
        pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
        pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
      
      We can also check the data_offset in the paca of various CPUs, with the fix we
      see:
      
        CPU 0:  data_offset = 0x0ffe8b0000
        CPU 24: data_offset = 0x1ffe5b0000
      
      And we can see from dmesg that CPU 24 has an allocation on node 1:
      
        node   0: [mem 0x0000000000000000-0x0000000fffffffff]
        node   1: [mem 0x0000001000000000-0x0000001fffffffff]
      
      Cc: stable@vger.kernel.org # v3.16+
      Fixes: 8c272261 ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ba4a648f
    • B
      powerpc/kernel: Initialize load_tm on task creation · 7f22ced4
      Breno Leitao 提交于
      Currently tsk->thread.load_tm is not initialized in the task creation
      and can contain garbage on a new task.
      
      This is an undesired behaviour, since it affects the timing to enable
      and disable the transactional memory laziness (disabling and enabling
      the MSR TM bit, which affects TM reclaim and recheckpoint in the
      scheduling process).
      
      Fixes: 5d176f75 ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f22ced4
  17. 05 6月, 2017 1 次提交
    • B
      powerpc/kernel: Fix FP and vector register restoration · 1195892c
      Breno Leitao 提交于
      Currently tsk->thread->load_vec and load_fp are not initialized during
      task creation, which can lead to garbage values in these variables (non-zero
      values).
      
      These variables will be checked later in restore_math() to validate if the
      FP and vector registers are being utilized. Since these values might be
      non-zero, the restore_math() will continue to save the FP and vectors even if
      they were never utilized by the userspace application. load_fp and load_vec
      counters will then overflow (they wrap at 255) and the FP and Altivec will be
      finally disabled, but before that condition is reached (counter overflow)
      several context switches will have restored FP and vector registers without
      need, causing a performance degradation.
      
      Fixes: 70fe3d98 ("powerpc: Restore FPU/VEC/VSX if previously used")
      Cc: stable@vger.kernel.org # v4.6+
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NGustavo Romero <gusbromero@gmail.com>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1195892c
  18. 02 6月, 2017 1 次提交