1. 03 3月, 2014 1 次提交
  2. 08 2月, 2014 2 次提交
    • W
      arm64: asm: remove redundant "cc" clobbers · 95c41896
      Will Deacon 提交于
      cbnz/tbnz don't update the condition flags, so remove the "cc" clobbers
      from inline asm blocks that only use these instructions to implement
      conditional branches.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      95c41896
    • W
      arm64: atomics: fix use of acquire + release for full barrier semantics · 8e86f0b4
      Will Deacon 提交于
      Linux requires a number of atomic operations to provide full barrier
      semantics, that is no memory accesses after the operation can be
      observed before any accesses up to and including the operation in
      program order.
      
      On arm64, these operations have been incorrectly implemented as follows:
      
      	// A, B, C are independent memory locations
      
      	<Access [A]>
      
      	// atomic_op (B)
      1:	ldaxr	x0, [B]		// Exclusive load with acquire
      	<op(B)>
      	stlxr	w1, x0, [B]	// Exclusive store with release
      	cbnz	w1, 1b
      
      	<Access [C]>
      
      The assumption here being that two half barriers are equivalent to a
      full barrier, so the only permitted ordering would be A -> B -> C
      (where B is the atomic operation involving both a load and a store).
      
      Unfortunately, this is not the case by the letter of the architecture
      and, in fact, the accesses to A and C are permitted to pass their
      nearest half barrier resulting in orderings such as Bl -> A -> C -> Bs
      or Bl -> C -> A -> Bs (where Bl is the load-acquire on B and Bs is the
      store-release on B). This is a clear violation of the full barrier
      requirement.
      
      The simple way to fix this is to implement the same algorithm as ARMv7
      using explicit barriers:
      
      	<Access [A]>
      
      	// atomic_op (B)
      	dmb	ish		// Full barrier
      1:	ldxr	x0, [B]		// Exclusive load
      	<op(B)>
      	stxr	w1, x0, [B]	// Exclusive store
      	cbnz	w1, 1b
      	dmb	ish		// Full barrier
      
      	<Access [C]>
      
      but this has the undesirable effect of introducing *two* full barrier
      instructions. A better approach is actually the following, non-intuitive
      sequence:
      
      	<Access [A]>
      
      	// atomic_op (B)
      1:	ldxr	x0, [B]		// Exclusive load
      	<op(B)>
      	stlxr	w1, x0, [B]	// Exclusive store with release
      	cbnz	w1, 1b
      	dmb	ish		// Full barrier
      
      	<Access [C]>
      
      The simple observations here are:
      
        - The dmb ensures that no subsequent accesses (e.g. the access to C)
          can enter or pass the atomic sequence.
      
        - The dmb also ensures that no prior accesses (e.g. the access to A)
          can pass the atomic sequence.
      
        - Therefore, no prior access can pass a subsequent access, or
          vice-versa (i.e. A is strictly ordered before C).
      
        - The stlxr ensures that no prior access can pass the store component
          of the atomic operation.
      
      The only tricky part remaining is the ordering between the ldxr and the
      access to A, since the absence of the first dmb means that we're now
      permitting re-ordering between the ldxr and any prior accesses.
      
      From an (arbitrary) observer's point of view, there are two scenarios:
      
        1. We have observed the ldxr. This means that if we perform a store to
           [B], the ldxr will still return older data. If we can observe the
           ldxr, then we can potentially observe the permitted re-ordering
           with the access to A, which is clearly an issue when compared to
           the dmb variant of the code. Thankfully, the exclusive monitor will
           save us here since it will be cleared as a result of the store and
           the ldxr will retry. Notice that any use of a later memory
           observation to imply observation of the ldxr will also imply
           observation of the access to A, since the stlxr/dmb ensure strict
           ordering.
      
        2. We have not observed the ldxr. This means we can perform a store
           and influence the later ldxr. However, that doesn't actually tell
           us anything about the access to [A], so we've not lost anything
           here either when compared to the dmb variant.
      
      This patch implements this solution for our barriered atomic operations,
      ensuring that we satisfy the full barrier requirements where they are
      needed.
      
      Cc: <stable@vger.kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      8e86f0b4
  3. 06 2月, 2014 1 次提交
  4. 05 2月, 2014 3 次提交
  5. 31 1月, 2014 2 次提交
  6. 27 1月, 2014 1 次提交
  7. 17 1月, 2014 1 次提交
  8. 12 1月, 2014 1 次提交
    • P
      arch: Introduce smp_load_acquire(), smp_store_release() · 47933ad4
      Peter Zijlstra 提交于
      A number of situations currently require the heavyweight smp_mb(),
      even though there is no need to order prior stores against later
      loads.  Many architectures have much cheaper ways to handle these
      situations, but the Linux kernel currently has no portable way
      to make use of them.
      
      This commit therefore supplies smp_load_acquire() and
      smp_store_release() to remedy this situation.  The new
      smp_load_acquire() primitive orders the specified load against
      any subsequent reads or writes, while the new smp_store_release()
      primitive orders the specifed store against any prior reads or
      writes.  These primitives allow array-based circular FIFOs to be
      implemented without an smp_mb(), and also allow a theoretical
      hole in rcu_assign_pointer() to be closed at no additional
      expense on most architectures.
      
      In addition, the RCU experience transitioning from explicit
      smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
      and rcu_assign_pointer(), respectively resulted in substantial
      improvements in readability.  It therefore seems likely that
      replacing other explicit barriers with smp_load_acquire() and
      smp_store_release() will provide similar benefits.  It appears
      that roughly half of the explicit barriers in core kernel code
      might be so replaced.
      
      [Changelog by PaulMck]
      Reviewed-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Victor Kaplansky <VICTORK@il.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/20131213150640.908486364@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      47933ad4
  9. 08 1月, 2014 5 次提交
  10. 28 12月, 2013 1 次提交
  11. 20 12月, 2013 9 次提交
  12. 18 12月, 2013 1 次提交
  13. 17 12月, 2013 5 次提交
    • L
      arm64: enable generic clockevent broadcast · 1f85008e
      Lorenzo Pieralisi 提交于
      On platforms with power management capabilities, timers that are shut
      down when a CPU enters deep C-states must be emulated using an always-on
      timer and a timer IPI to relay the timer IRQ to target CPUs on an SMP
      system.
      
      This patch enables the generic clockevents broadcast infrastructure for
      arm64, by providing the required Kconfig entries and adding the timer
      IPI infrastructure.
      Acked-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      1f85008e
    • L
      arm64: kernel: cpu_{suspend/resume} implementation · 95322526
      Lorenzo Pieralisi 提交于
      Kernel subsystems like CPU idle and suspend to RAM require a generic
      mechanism to suspend a processor, save its context and put it into
      a quiescent state. The cpu_{suspend}/{resume} implementation provides
      such a framework through a kernel interface allowing to save/restore
      registers, flush the context to DRAM and suspend/resume to/from
      low-power states where processor context may be lost.
      
      The CPU suspend implementation relies on the suspend protocol registered
      in CPU operations to carry out a suspend request after context is
      saved and flushed to DRAM. The cpu_suspend interface:
      
      int cpu_suspend(unsigned long arg);
      
      allows to pass an opaque parameter that is handed over to the suspend CPU
      operations back-end so that it can take action according to the
      semantics attached to it. The arg parameter allows suspend to RAM and CPU
      idle drivers to communicate to suspend protocol back-ends; it requires
      standardization so that the interface can be reused seamlessly across
      systems, paving the way for generic drivers.
      
      Context memory is allocated on the stack, whose address is stashed in a
      per-cpu variable to keep track of it and passed to core functions that
      save/restore the registers required by the architecture.
      
      Even though, upon successful execution, the cpu_suspend function shuts
      down the suspending processor, the warm boot resume mechanism, based
      on the cpu_resume function, makes the resume path operate as a
      cpu_suspend function return, so that cpu_suspend can be treated as a C
      function by the caller, which simplifies coding the PM drivers that rely
      on the cpu_suspend API.
      
      Upon context save, the minimal amount of memory is flushed to DRAM so
      that it can be retrieved when the MMU is off and caches are not searched.
      
      The suspend CPU operation, depending on the required operations (eg CPU vs
      Cluster shutdown) is in charge of flushing the cache hierarchy either
      implicitly (by calling firmware implementations like PSCI) or explicitly
      by executing the required cache maintainance functions.
      
      Debug exceptions are disabled during cpu_{suspend}/{resume} operations
      so that debug registers can be saved and restored properly preventing
      preemption from debug agents enabled in the kernel.
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      95322526
    • L
      arm64: kernel: suspend/resume registers save/restore · 6732bc65
      Lorenzo Pieralisi 提交于
      Power management software requires the kernel to save and restore
      CPU registers while going through suspend and resume operations
      triggered by kernel subsystems like CPU idle and suspend to RAM.
      
      This patch implements code that provides save and restore mechanism
      for the arm v8 implementation. Memory for the context is passed as
      parameter to both cpu_do_suspend and cpu_do_resume functions, and allows
      the callers to implement context allocation as they deem fit.
      
      The registers that are saved and restored correspond to the registers set
      actually required by the kernel to be up and running which represents a
      subset of v8 ISA.
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      6732bc65
    • L
      arm64: kernel: build MPIDR_EL1 hash function data structure · 976d7d3f
      Lorenzo Pieralisi 提交于
      On ARM64 SMP systems, cores are identified by their MPIDR_EL1 register.
      The MPIDR_EL1 guidelines in the ARM ARM do not provide strict enforcement of
      MPIDR_EL1 layout, only recommendations that, if followed, split the MPIDR_EL1
      on ARM 64 bit platforms in four affinity levels. In multi-cluster
      systems like big.LITTLE, if the affinity guidelines are followed, the
      MPIDR_EL1 can not be considered a linear index. This means that the
      association between logical CPU in the kernel and the HW CPU identifier
      becomes somewhat more complicated requiring methods like hashing to
      associate a given MPIDR_EL1 to a CPU logical index, in order for the look-up
      to be carried out in an efficient and scalable way.
      
      This patch provides a function in the kernel that starting from the
      cpu_logical_map, implement collision-free hashing of MPIDR_EL1 values by
      checking all significative bits of MPIDR_EL1 affinity level bitfields.
      The hashing can then be carried out through bits shifting and ORing; the
      resulting hash algorithm is a collision-free though not minimal hash that can
      be executed with few assembly instructions. The mpidr_el1 is filtered through a
      mpidr mask that is built by checking all bits that toggle in the set of
      MPIDR_EL1s corresponding to possible CPUs. Bits that do not toggle do not
      carry information so they do not contribute to the resulting hash.
      
      Pseudo code:
      
      /* check all bits that toggle, so they are required */
      for (i = 1, mpidr_el1_mask = 0; i < num_possible_cpus(); i++)
      	mpidr_el1_mask |= (cpu_logical_map(i) ^ cpu_logical_map(0));
      
      /*
       * Build shifts to be applied to aff0, aff1, aff2, aff3 values to hash the
       * mpidr_el1
       * fls() returns the last bit set in a word, 0 if none
       * ffs() returns the first bit set in a word, 0 if none
       */
      fs0 = mpidr_el1_mask[7:0] ? ffs(mpidr_el1_mask[7:0]) - 1 : 0;
      fs1 = mpidr_el1_mask[15:8] ? ffs(mpidr_el1_mask[15:8]) - 1 : 0;
      fs2 = mpidr_el1_mask[23:16] ? ffs(mpidr_el1_mask[23:16]) - 1 : 0;
      fs3 = mpidr_el1_mask[39:32] ? ffs(mpidr_el1_mask[39:32]) - 1 : 0;
      ls0 = fls(mpidr_el1_mask[7:0]);
      ls1 = fls(mpidr_el1_mask[15:8]);
      ls2 = fls(mpidr_el1_mask[23:16]);
      ls3 = fls(mpidr_el1_mask[39:32]);
      bits0 = ls0 - fs0;
      bits1 = ls1 - fs1;
      bits2 = ls2 - fs2;
      bits3 = ls3 - fs3;
      aff0_shift = fs0;
      aff1_shift = 8 + fs1 - bits0;
      aff2_shift = 16 + fs2 - (bits0 + bits1);
      aff3_shift = 32 + fs3 - (bits0 + bits1 + bits2);
      u32 hash(u64 mpidr_el1) {
      	u32 l[4];
      	u64 mpidr_el1_masked = mpidr_el1 & mpidr_el1_mask;
      	l[0] = mpidr_el1_masked & 0xff;
      	l[1] = mpidr_el1_masked & 0xff00;
      	l[2] = mpidr_el1_masked & 0xff0000;
      	l[3] = mpidr_el1_masked & 0xff00000000;
      	return (l[0] >> aff0_shift | l[1] >> aff1_shift | l[2] >> aff2_shift |
      		l[3] >> aff3_shift);
      }
      
      The hashing algorithm relies on the inherent properties set in the ARM ARM
      recommendations for the MPIDR_EL1. Exotic configurations, where for instance
      the MPIDR_EL1 values at a given affinity level have large holes, can end up
      requiring big hash tables since the compression of values that can be achieved
      through shifting is somewhat crippled when holes are present. Kernel warns if
      the number of buckets of the resulting hash table exceeds the number of
      possible CPUs by a factor of 4, which is a symptom of a very sparse HW
      MPIDR_EL1 configuration.
      
      The hash algorithm is quite simple and can easily be implemented in assembly
      code, to be used in code paths where the kernel virtual address space is
      not set-up (ie cpu_resume) and instruction and data fetches are strongly
      ordered so code must be compact and must carry out few data accesses.
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      976d7d3f
    • L
      arm64: kernel: add MPIDR_EL1 accessors macros · b058450f
      Lorenzo Pieralisi 提交于
      In order to simplify access to different affinity levels within the
      MPIDR_EL1 register values, this patch implements some preprocessor
      macros that allow to retrieve the MPIDR_EL1 affinity level value according
      to the level passed as input parameter.
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      b058450f
  14. 12 12月, 2013 2 次提交
    • S
      arm/arm64: kvm: Use virt_to_idmap instead of virt_to_phys for idmap mappings · 4fda342c
      Santosh Shilimkar 提交于
      KVM initialisation fails on architectures implementing virt_to_idmap()
      because virt_to_phys() on such architectures won't fetch you the correct
      idmap page.
      
      So update the KVM ARM code to use the virt_to_idmap() to fix the issue.
      Since the KVM code is shared between arm and arm64, we create
      kvm_virt_to_phys() and handle the redirection in respective headers.
      
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      4fda342c
    • S
      xen/arm64: do not call the swiotlb functions twice · 02ab71cd
      Stefano Stabellini 提交于
      On arm64 the dma_map_ops implementation is based on the swiotlb.
      swiotlb-xen, used by default in dom0 on Xen, is also based on the
      swiotlb.
      
      Avoid calling into the default arm64 dma_map_ops functions from
      xen_dma_map_page, xen_dma_unmap_page, xen_dma_sync_single_for_cpu, and
      xen_dma_sync_single_for_device otherwise we end up calling into the
      swiotlb twice.
      
      When arm64 gets a non-swiotlb based implementation of dma_map_ops, we'll
      probably have to reintroduce dma_map_ops calls in page-coherent.h.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      CC: catalin.marinas@arm.com
      CC: Will.Deacon@arm.com
      CC: Ian.Campbell@citrix.com
      02ab71cd
  15. 07 12月, 2013 2 次提交
  16. 29 11月, 2013 2 次提交
  17. 26 11月, 2013 1 次提交