1. 02 2月, 2015 1 次提交
    • R
      cxl: Fix device_node reference counting · 6f963ec2
      Ryan Grimm 提交于
      When unbinding and rebinding the driver on a system with a card in PHB0, this
      error condition is reached after a few attempts:
      
      ERROR: Bad of_node_put() on /pciex@3fffe40000000
      CPU: 0 PID: 3040 Comm: bash Not tainted 3.18.0-rc3-12545-g3627ffe #152
      Call Trace:
      [c000000721acb5c0] [c00000000086ef94] .dump_stack+0x84/0xb0 (unreliable)
      [c000000721acb640] [c00000000073a0a8] .of_node_release+0xd8/0xe0
      [c000000721acb6d0] [c00000000044bc44] .kobject_release+0x74/0xe0
      [c000000721acb760] [c0000000007394fc] .of_node_put+0x1c/0x30
      [c000000721acb7d0] [c000000000545cd8] .cxl_probe+0x1a98/0x1d50
      [c000000721acb900] [c0000000004845a0] .local_pci_probe+0x40/0xc0
      [c000000721acb980] [c000000000484998] .pci_device_probe+0x128/0x170
      [c000000721acba30] [c00000000052400c] .driver_probe_device+0xac/0x2a0
      [c000000721acbad0] [c000000000522468] .bind_store+0x108/0x160
      [c000000721acbb70] [c000000000521448] .drv_attr_store+0x38/0x60
      [c000000721acbbe0] [c000000000293840] .sysfs_kf_write+0x60/0xa0
      [c000000721acbc50] [c000000000292500] .kernfs_fop_write+0x140/0x1d0
      [c000000721acbcf0] [c000000000208648] .vfs_write+0xd8/0x260
      [c000000721acbd90] [c000000000208b18] .SyS_write+0x58/0x100
      [c000000721acbe30] [c000000000009258] syscall_exit+0x0/0x98
      
      We are missing a call to of_node_get(). pnv_pci_to_phb_node() should
      call of_node_get() otherwise np's reference count isn't incremented and
      it might go away. Rename pnv_pci_to_phb_node() to pnv_pci_get_phb_node()
      so it's clear it calls of_node_get().
      Signed-off-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Acked-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6f963ec2
  2. 30 1月, 2015 4 次提交
  3. 28 1月, 2015 1 次提交
    • M
      powerpc: Remove some unused functions · 8aa989b8
      Michael Ellerman 提交于
      Remove slice_set_psize() which is not used.
      
      It was added in 3a8247cc "powerpc: Only demote individual slices
      rather than whole process" but was never used.
      
      Remove vsx_assist_exception() which is not used.
      
      It was added in ce48b210 "powerpc: Add VSX context save/restore,
      ptrace and signal support" but was never used.
      
      Remove generic_mach_cpu_die() which is not used.
      
      Its last caller was removed in 375f561a "powerpc/powernv: Always go
      into nap mode when CPU is offline".
      
      Remove mpc7448_hpc2_power_off() and mpc7448_hpc2_halt() which are
      unused.
      
      These were introduced in c5d56332 "[POWERPC] Add general support for
      mpc7448hpc2 (Taiga) platform" but were never used.
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      [mpe: Update changelog with details on when/why they are unused]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8aa989b8
  4. 27 1月, 2015 1 次提交
    • C
      powerpc/pseries: Fix endian problems with LE migration · 3df76a9d
      Cyril Bur 提交于
      RTAS events require arguments be passed in big endian while hypercalls
      have their arguments passed in registers and the values should therefore
      be in CPU endian.
      
      The "ibm,suspend_me" 'RTAS' call makes a sequence of hypercalls to setup
      one true RTAS call. This means that "ibm,suspend_me" is handled
      specially in the ppc_rtas() syscall.
      
      The ppc_rtas() syscall has its arguments in big endian and can therefore
      pass these arguments directly to the RTAS call. "ibm,suspend_me" is
      handled specially from within ppc_rtas() (by calling rtas_ibm_suspend_me())
      which has left an endian bug on little endian systems due to the
      requirement of hypercalls. The return value from rtas_ibm_suspend_me()
      gets returned in cpu endian, and is left unconverted, also a bug on
      little endian systems.
      
      rtas_ibm_suspend_me() does not actually make use of the rtas_args that
      it is passed. This patch removes the convoluted use of the rtas_args
      struct to pass params to rtas_ibm_suspend_me() in favour of passing what
      it needs as actual arguments. This patch also ensures the two callers of
      rtas_ibm_suspend_me() pass function parameters in cpu endian and in the
      case of ppc_rtas(), converts the return value.
      
      migrate_store() (the other caller of rtas_ibm_suspend_me()) is from a
      sysfs file which deals with everything in cpu endian so this function
      only underwent cleanup.
      
      This patch has been tested with KVM both LE and BE and on PowerVM both
      LE and BE. Under QEMU/KVM the migration happens without touching these
      code pathes.
      
      For PowerVM there is no obvious regression on BE and the LE code path
      now provides the correct parameters to the hypervisor.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3df76a9d
  5. 23 1月, 2015 6 次提交
    • G
      powerpc/eeh: Allow to set maximal frozen times · 1b28f170
      Gavin Shan 提交于
      When PE's frozen count hits maximal allowed frozen times, which is
      5 currently, it will be forced to be offline permanently. Once the
      PE is removed permanently, rebooting machine is required to bring
      the PE back. It's not convienent when testing EEH functionality.
      
      The patch exports the maximal allowed frozen times through debugfs
      entry (/sys/kernel/debug/powerpc/eeh_max_freezes).
      Requested-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1b28f170
    • G
      powerpc/eeh: Introduce flag EEH_PE_REMOVED · 432227e9
      Gavin Shan 提交于
      The conditions that one specific PE's frozen count exceeds the maximal
      allowed times (EEH_MAX_ALLOWED_FREEZES) and it's in isolated or recovery
      state indicate the PE was removed permanently implicitly. The patch
      introduces flag EEH_PE_REMOVED to indicate that explicitly so that we
      don't depend on the fixed maximal allowed times, which can be varied as
      we do in subsequent patch.
      
      Flag EEH_PE_REMOVED is expected to be marked for the PE whose frozen
      count exceeds the maximal allowed times, or just failed from recovery.
      Requested-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      432227e9
    • G
      powerpc/eeh: Fix missed PE#0 on P7IOC · 2aa5cf9e
      Gavin Shan 提交于
      PE#0 should be regarded as valid for P7IOC, while it's invalid for
      PHB3. The patch adds flag EEH_VALID_PE_ZERO to differentiate those
      two cases. Without the patch, we possibly see frozen PE#0 state is
      cleared without EEH recovery taken on P7IOC as following kernel logs
      indicate:
      
      [root@ltcfbl8eb ~]# dmesg
             :
      pci 0000:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0000:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0001:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0001:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0002:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0002:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0003:00     : [PE# 000] Secondary bus 0 associated with PE#0
      pci 0003:01     : [PE# 001] Secondary bus 1 associated with PE#1
      pci 0003:20     : [PE# 002] Secondary bus 32..63 associated with PE#2
             :
      EEH: Clear non-existing PHB#3-PE#0
      EEH: PHB location: U78AE.001.WZS00M9-P1-002
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2aa5cf9e
    • N
      powerpc/kprobes: Fix kallsyms lookup across powerpc ABIv1 and ABIv2 · bf794bf5
      Naveen N. Rao 提交于
      Currently, all non-dot symbols are being treated as function descriptors
      in ABIv1. This is incorrect and is resulting in perf probe not working:
      
        # perf probe do_fork
        Added new event:
        Failed to write event: Invalid argument
          Error: Failed to add events.
        # dmesg | tail -1
        [192268.073063] Could not insert probe at _text+768432: -22
      
      perf probe bases all kernel probes on _text and writes,
      for example, "p:probe/do_fork _text+768432" to
      /sys/kernel/debug/tracing/kprobe_events. In-kernel, _text is being
      considered to be a function descriptor and is resulting in the above
      error.
      
      Fix this by changing how we lookup symbol addresses on ppc64. We first
      check for the dot variant of a symbol and look at the non-dot variant
      only if that fails. In this manner, we avoid having to look at the
      function descriptor.
      
      While at it, also separate out how this works on ABIv2 where
      we don't have dot symbols, but need to use the local entry point.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bf794bf5
    • M
      powerpc: Rename _TIF_SYSCALL_T_OR_A to _TIF_SYSCALL_DOTRACE · 10ea8343
      Michael Ellerman 提交于
      Once upon a time, at least 9 years ago (< 2.6.12), _TIF_SYSCALL_T_OR_A
      meant "TRACE or AUDIT". But these days it means TRACE or AUDIT or
      SECCOMP or TRACEPOINT or NOHZ.
      
      All of those are implemented via syscall_dotrace() so rename the flag to
      that to try and clarify things.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10ea8343
    • M
      powerpc: Remove unused CPU_FTR_IABR · ed77d418
      Michael Ellerman 提交于
      We removed the last usage of CPU_FTR_IABR in commit 1ad7d705
      "powerpc/xmon: Enable HW instruction breakpoint on POWER8".
      
      Mark it as free.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ed77d418
  6. 22 1月, 2015 1 次提交
  7. 18 12月, 2014 1 次提交
    • M
      powerpc/uaccess: Allow get_user() with bitwise types · 505e4283
      Michael S. Tsirkin 提交于
      At the moment, if p and x are both of the same bitwise type
      (eg. __le32), get_user(x, p) produces a sparse warning.
      
      This is because *p is loaded into a long then cast back to typeof(*p).
      
      When typeof(*p) is a bitwise type (which is uncommon), such a cast needs
      __force, otherwise sparse produces a warning.
      
      For non-bitwise types __force should have no effect, and should not hide
      any legitimate errors.
      
      Note that we are casting to typeof(*p) not typeof(x). Even with the
      cast, if x and *p are of different types we should get the warning, so I
      think we are not loosing the ability to detect any actual errors.
      
      virtio would like to use bitwise types with get_user() so fix these
      spurious warnings by adding __force.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      [mpe: Fill in changelog with more details]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      505e4283
  8. 17 12月, 2014 4 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8
    • P
      KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register · 4a157d61
      Paul Mackerras 提交于
      There are two ways in which a guest instruction can be obtained from
      the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
      exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
      instruction), the offending instruction is in the HEIR register
      (Hypervisor Emulation Instruction Register).  If the exit was caused
      by a load or store to an emulated MMIO device, we load the instruction
      from the guest by turning data relocation on and loading the instruction
      with an lwz instruction.
      
      Unfortunately, in the case where the guest has opposite endianness to
      the host, these two methods give results of different endianness, but
      both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
      using guest endianness, whereas the lwz will load the instruction using
      host endianness.  The rest of the code that uses vcpu->arch.last_inst
      assumes it was loaded using host endianness.
      
      To fix this, we define a new vcpu field to store the HEIR value.  Then,
      in kvmppc_handle_exit_hv(), we transfer the value from this new field to
      vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
      differ.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4a157d61
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
    • P
      KVM: PPC: Book3S HV: Simplify locking around stolen time calculations · 2711e248
      Paul Mackerras 提交于
      Currently the calculations of stolen time for PPC Book3S HV guests
      uses fields in both the vcpu struct and the kvmppc_vcore struct.  The
      fields in the kvmppc_vcore struct are protected by the
      vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
      running the virtual core.  This works correctly but confuses lockdep,
      because it sees that the code takes the tbacct_lock for a vcpu in
      kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
      vcore_stolen_time(), and it thinks there is a possibility of deadlock,
      causing it to print reports like this:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
      ---------------------------------------------
      qemu-system-ppc/6188 is trying to acquire lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      
      but task is already holding lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by qemu-system-ppc/6188:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
       #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
       #2:  (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      stack backtrace:
      CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
      Call Trace:
      [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
      [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
      [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
      [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
      [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
      [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
      [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
      [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
      [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
      [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
      [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
      
      In order to make the locking easier to analyse, we change the code to
      use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
      preempt_tb fields.  This lock needs to be an irq-safe lock since it is
      used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
      functions, which are called with the scheduler rq lock held, which is
      an irq-safe lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2711e248
  9. 15 12月, 2014 5 次提交
    • P
      KVM: PPC: Book3S HV: Fix computation of tlbie operand · d506735b
      Paul Mackerras 提交于
      The B (segment size) field in the RB operand for the tlbie
      instruction is two bits, which we get from the top two bits of
      the first doubleword of the HPT entry to be invalidated.  These
      bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM
      bit numbering).
      
      The compute_tlbie_rb() function gets these bits as v >> (62 - 8),
      which is not correct as it will bring in the top 10 bits, not
      just the top two.  These extra bits could corrupt the AP, AVAL
      and L fields in the RB value.  To fix this we shift right 62 bits
      and then shift left 8 bits, so we only get the two bits of the
      B field.
      
      The first doubleword of the HPT entry is under the control of the
      guest kernel.  In fact, Linux guests will always put zeroes in bits
      54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing
      this.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d506735b
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
    • S
      powerpc/powernv: Enable Offline CPUs to enter deep idle states · 8eb8ac89
      Shreyas B. Prabhu 提交于
      The secondary threads should enter deep idle states so as to gain maximum
      powersavings when the entire core is offline. To do so the offline path
      must be made aware of the available deepest idle state. Hence probe the
      device tree for the possible idle states in powernv core code and
      expose the deepest idle state through flags.
      
      Since the  device tree is probed by the cpuidle driver as well, move
      the parameters required to discover the idle states into an appropriate
      common place to both the driver and the powernv core code.
      
      Another point is that fastsleep idle state may require workarounds in
      the kernel to function properly. This workaround is introduced in the
      subsequent patches. However neither the cpuidle driver or the hotplug
      path need be bothered about this workaround.
      
      They will be taken care of by the core powernv code.
      Originally-by: NSrivatsa S. Bhat <srivatsa@mit.edu>
      Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8eb8ac89
    • P
      powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a
      Paul Mackerras 提交于
      Currently, when going idle, we set the flag indicating that we are in
      nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
      (or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
      for two reasons: (a) the architecture specifies that those instructions
      must be executed with the MMU off, and in fact with only the SF, HV, ME
      and possibly RI bits set, and (b) this introduces a race, because as
      soon as we set the flag, another thread can switch the MMU to a guest
      context.  If the race is lost, this thread will typically start looping
      on relocation-on ISIs at 0xc...4400.
      
      This fixes it by setting the MSR as required by the architecture before
      setting the flag or executing the nap/sleep/rvwinkle instruction.
      
      Cc: stable@vger.kernel.org
      [ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8117ac6a
  10. 14 12月, 2014 1 次提交
  11. 12 12月, 2014 3 次提交
    • R
      powerpc: add little endian flag to syscall_get_arch() · 63f13448
      Richard Guy Briggs 提交于
      Since both ppc and ppc64 have LE variants which are now reported by uname, add
      that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add AUDIT_ARCH_PPC64LE
      variant.
      
      Without this,  perf trace and auditctl fail.
      
      Mainline kernel reports ppc64le (per a0588015) but there is no matching
      AUDIT_ARCH_PPC64LE.
      
      Since 32-bit PPC LE is not supported by audit, don't advertise it in
      AUDIT_ARCH_PPC* variants.
      
      See:
      	https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html
      	https://www.redhat.com/archives/linux-audit/2014-December/msg00004.htmlSigned-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      63f13448
    • A
      arch: Add lightweight memory barriers dma_rmb() and dma_wmb() · 1077fa36
      Alexander Duyck 提交于
      There are a number of situations where the mandatory barriers rmb() and
      wmb() are used to order memory/memory operations in the device drivers
      and those barriers are much heavier than they actually need to be.  For
      example in the case of PowerPC wmb() calls the heavy-weight sync
      instruction when for coherent memory operations all that is really needed
      is an lsync or eieio instruction.
      
      This commit adds a coherent only version of the mandatory memory barriers
      rmb() and wmb().  In most cases this should result in the barrier being the
      same as the SMP barriers for the SMP case, however in some cases we use a
      barrier that is somewhere in between rmb() and smp_rmb().  For example on
      ARM the rmb barriers break down as follows:
      
        Barrier   Call     Explanation
        --------- -------- ----------------------------------
        rmb()     dsb()    Data synchronization barrier - system
        dma_rmb() dmb(osh) data memory barrier - outer sharable
        smp_rmb() dmb(ish) data memory barrier - inner sharable
      
      These new barriers are not as safe as the standard rmb() and wmb().
      Specifically they do not guarantee ordering between coherent and incoherent
      memories.  The primary use case for these would be to enforce ordering of
      reads and writes when accessing coherent memory that is shared between the
      CPU and a device.
      
      It may also be noted that there is no dma_mb().  Most architectures don't
      provide a good mechanism for performing a coherent only full barrier without
      resorting to the same mechanism used in mb().  As such there isn't much to
      be gained in trying to define such a function.
      
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1077fa36
    • A
      arch: Cleanup read_barrier_depends() and comments · 8a449718
      Alexander Duyck 提交于
      This patch is meant to cleanup the handling of read_barrier_depends and
      smp_read_barrier_depends.  In multiple spots in the kernel headers
      read_barrier_depends is defined as "do {} while (0)", however we then go
      into the SMP vs non-SMP sections and have the SMP version reference
      read_barrier_depends, and the non-SMP define it as yet another empty
      do/while.
      
      With this commit I went through and cleaned out the duplicate definitions
      and reduced the number of definitions down to 2 per header.  In addition I
      moved the 50 line comments for the macro from the x86 and mips headers that
      defined it as an empty do/while to those that were actually defining the
      macro, alpha and blackfin.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a449718
  12. 11 12月, 2014 2 次提交
  13. 08 12月, 2014 1 次提交
    • P
      powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0
      Paul Mackerras 提交于
      When a secondary hardware thread has finished running a KVM guest, we
      currently put that thread into nap mode using a nap instruction in
      the KVM code.  This changes the code so that instead of doing a nap
      instruction directly, we instead cause the call to power7_nap() that
      put the thread into nap mode to return.  The reason for doing this is
      to avoid having the KVM code having to know what low-power mode to
      put the thread into.
      
      In the case of a secondary thread used to run a KVM guest, the thread
      will be offline from the point of view of the host kernel, and the
      relevant power7_nap() call is the one in pnv_smp_cpu_disable().
      In this case we don't want to clear pending IPIs in the offline loop
      in that function, since that might cause us to miss the wakeup for
      the next time the thread needs to run a guest.  To tell whether or
      not to clear the interrupt, we use the SRR1 value returned from
      power7_nap(), and check if it indicates an external interrupt.  We
      arrange that the return from power7_nap() when we have finished running
      a guest returns 0, so pending interrupts don't get flushed in that
      case.
      
      Note that it is important a secondary thread that has finished
      executing in the guest, or that didn't have a guest to run, should
      not return to power7_nap's caller while the kvm_hstate.hwthread_req
      flag in the PACA is non-zero, because the return from power7_nap
      will reenable the MMU, and the MMU might still be in guest context.
      In this situation we spin at low priority in real mode waiting for
      hwthread_req to become zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      56548fc0
  14. 06 12月, 2014 1 次提交
    • A
      net: sock: allow eBPF programs to be attached to sockets · 89aa0758
      Alexei Starovoitov 提交于
      introduce new setsockopt() command:
      
      setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
      
      where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
      and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
      
      setsockopt() calls bpf_prog_get() which increments refcnt of the program,
      so it doesn't get unloaded while socket is using the program.
      
      The same eBPF program can be attached to multiple sockets.
      
      User task exit automatically closes socket which calls sk_filter_uncharge()
      which decrements refcnt of eBPF program
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89aa0758
  15. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  16. 02 12月, 2014 5 次提交
  17. 24 11月, 2014 1 次提交
  18. 18 11月, 2014 1 次提交