1. 06 11月, 2017 6 次提交
    • C
      KVM: arm/arm64: Rework kvm_timer_should_fire · 1c88ab7e
      Christoffer Dall 提交于
      kvm_timer_should_fire() can be called in two different situations from
      the kvm_vcpu_block().
      
      The first case is before calling kvm_timer_schedule(), used for wait
      polling, and in this case the VCPU thread is running and the timer state
      is loaded onto the hardware so all we have to do is check if the virtual
      interrupt lines are asserted, becasue the timer interrupt handler
      functions will raise those lines as appropriate.
      
      The second case is inside the wait loop of kvm_vcpu_block(), where we
      have already called kvm_timer_schedule() and therefore the hardware will
      be disabled and the software view of the timer state is up to date
      (timer->loaded is false), and so we can simply check if the timer should
      fire by looking at the software state.
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      1c88ab7e
    • C
      KVM: arm/arm64: Get rid of kvm_timer_flush_hwstate · 7e90c8e5
      Christoffer Dall 提交于
      Now when both the vtimer and the ptimer when using both the in-kernel
      vgic emulation and a userspace IRQ chip are driven by the timer signals
      and at the vcpu load/put boundaries, instead of recomputing the timer
      state at every entry/exit to/from the guest, we can get entirely rid of
      the flush hwstate function.
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      7e90c8e5
    • C
      KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit · b103cc3f
      Christoffer Dall 提交于
      We don't need to save and restore the hardware timer state and examine
      if it generates interrupts on on every entry/exit to the guest.  The
      timer hardware is perfectly capable of telling us when it has expired
      by signaling interrupts.
      
      When taking a vtimer interrupt in the host, we don't want to mess with
      the timer configuration, we just want to forward the physical interrupt
      to the guest as a virtual interrupt.  We can use the split priority drop
      and deactivate feature of the GIC to do this, which leaves an EOI'ed
      interrupt active on the physical distributor, making sure we don't keep
      taking timer interrupts which would prevent the guest from running.  We
      can then forward the physical interrupt to the VM using the HW bit in
      the LR of the GIC, like we do already, which lets the guest directly
      deactivate both the physical and virtual timer simultaneously, allowing
      the timer hardware to exit the VM and generate a new physical interrupt
      when the timer output is again asserted later on.
      
      We do need to capture this state when migrating VCPUs between physical
      CPUs, however, which we use the vcpu put/load functions for, which are
      called through preempt notifiers whenever the thread is scheduled away
      from the CPU or called directly if we return from the ioctl to
      userspace.
      
      One caveat is that we have to save and restore the timer state in both
      kvm_timer_vcpu_[put/load] and kvm_timer_[schedule/unschedule], because
      we can have the following flows:
      
        1. kvm_vcpu_block
        2. kvm_timer_schedule
        3. schedule
        4. kvm_timer_vcpu_put (preempt notifier)
        5. schedule (vcpu thread gets scheduled back)
        6. kvm_timer_vcpu_load (preempt notifier)
        7. kvm_timer_unschedule
      
      And a version where we don't actually call schedule:
      
        1. kvm_vcpu_block
        2. kvm_timer_schedule
        7. kvm_timer_unschedule
      
      Since kvm_timer_[schedule/unschedule] may not be followed by put/load,
      but put/load also may be called independently, we call the timer
      save/restore functions from both paths.  Since they rely on the loaded
      flag to never save/restore when unnecessary, this doesn't cause any
      harm, and we ensure that all invokations of either set of functions work
      as intended.
      
      An added benefit beyond not having to read and write the timer sysregs
      on every entry and exit is that we no longer have to actively write the
      active state to the physical distributor, because we configured the
      irq for the vtimer to only get a priority drop when handling the
      interrupt in the GIC driver (we called irq_set_vcpu_affinity()), and
      the interrupt stays active after firing on the host.
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      b103cc3f
    • C
      KVM: arm/arm64: Use separate timer for phys timer emulation · f2a2129e
      Christoffer Dall 提交于
      We were using the same hrtimer for emulating the physical timer and for
      making sure a blocking VCPU thread would be eventually woken up.  That
      worked fine in the previous arch timer design, but as we are about to
      actually use the soft timer expire function for the physical timer
      emulation, change the logic to use a dedicated hrtimer.
      
      This has the added benefit of not having to cancel any work in the sync
      path, which in turn allows us to run the flush and sync with IRQs
      disabled.
      
      Note that the hrtimer used to program the host kernel's timer to
      generate an exit from the guest when the emulated physical timer fires
      never has to inject any work, and to share the soft_timer_cancel()
      function with the bg_timer, we change the function to only cancel any
      pending work if the pointer to the work struct is not null.
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      f2a2129e
    • C
      KVM: arm/arm64: Rename soft timer to bg_timer · 14d61fa9
      Christoffer Dall 提交于
      As we are about to introduce a separate hrtimer for the physical timer,
      call this timer bg_timer, because we refer to this timer as the
      background timer in the code and comments elsewhere.
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      14d61fa9
    • C
      KVM: arm/arm64: Make timer_arm and timer_disarm helpers more generic · 8409a06f
      Christoffer Dall 提交于
      We are about to add an additional soft timer to the arch timer state for
      a VCPU and would like to be able to reuse the functions to program and
      cancel a timer, so we make them slightly more generic and rename to make
      it more clear that these functions work on soft timers and not the
      hardware resource that this code is managing.
      
      The armed flag on the timer state is only used to assert a condition,
      and we don't rely on this assertion in any meaningful way, so we can
      simply get rid of this flack and slightly reduce complexity.
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      8409a06f
  2. 02 11月, 2017 3 次提交
    • P
      irqchip: mips-gic: Use irq_cpu_online to (un)mask all-VP(E) IRQs · da61fcf9
      Paul Burton 提交于
      The gic_all_vpes_local_irq_controller chip currently attempts to operate
      on all CPUs/VPs in the system when masking or unmasking an interrupt.
      This has a few drawbacks:
      
       - In multi-cluster systems we may not always have access to all CPUs in
         the system. When all CPUs in a cluster are powered down that
         cluster's GIC may also power down, in which case we cannot configure
         its state.
      
       - Relatedly, if we power down a cluster after having configured
         interrupts for CPUs within it then the cluster's GIC may lose state &
         we need to reconfigure it. The current approach doesn't take this
         into account.
      
       - It's wasteful if we run Linux on fewer VPs than are present in the
         system. For example if we run a uniprocessor kernel on CPU0 of a
         system with 16 CPUs then there's no point in us configuring CPUs
         1-15.
      
       - The implementation is also lacking in that it expects the range
         0..gic_vpes-1 to represent valid Linux CPU numbers which may not
         always be the case - for example if we run on a system with more VPs
         than the kernel is configured to support.
      
      Fix all of these issues by only configuring the affected interrupts for
      CPUs which are online at the time, and recording the configuration in a
      new struct gic_all_vpes_chip_data for later use by CPUs being brought
      online. We register a CPU hotplug state (reusing
      CPUHP_AP_IRQ_GIC_STARTING which the ARM GIC driver uses, and which seems
      suitably generic for reuse with the MIPS GIC) and execute
      irq_cpu_online() in order to configure the interrupts on the newly
      onlined CPU.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mips@linux-mips.org
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      da61fcf9
    • D
      irqdomain: Update the comments of fwnode field of irq_domain structure · 4b821300
      Dou Liyang 提交于
      Commit:
      
      f110711a ("irqdomain: Convert irqdomain-%3Eof_node to fwnode")
      
      converted of_node field to fwnode, but didn't update its comments.
      
      Update it.
      
      Fixes: f110711a ("irqdomain: Convert irqdomain-%3Eof_node to fwnode")
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      4b821300
    • M
      irqchip/gic-v3-its: Setup VLPI properties at map time · d4d7b4ad
      Marc Zyngier 提交于
      So far, we require the hypervisor to update the VLPI properties
      once the the VLPI mapping has been established. While this
      makes it easy for the ITS driver, it creates a window where
      an incoming interrupt can be delivered with an unknown set
      of properties. Not very nice.
      
      Instead, let's add a "properties" field to the mapping structure,
      and use that to configure the VLPI before it actually gets mapped.
      Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      d4d7b4ad
  3. 19 10月, 2017 4 次提交
    • M
      irqchip/gic-v3-its: Limit scope of VPE mapping to be per ITS · 2247e1bf
      Marc Zyngier 提交于
      So far, we map all VPEs on all ITSs. While this is not wrong,
      this is quite a big hammer, as moving a VPE around requires
      all ITSs to be synchronized. Needles to say, this is an
      expensive proposition.
      
      Instead, let's switch to a mode where we issue VMAPP commands
      only on ITSs that are actually involved in reporting interrupts
      to the given VM.
      
      For that purpose, we refcount the number of interrupts are are
      mapped for this VM on each ITS, performing the map/unmap
      operations as required. It then allows us to use this refcount
      to only issue VMOVP to the ITSs that need to know about this
      VM.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      2247e1bf
    • M
      irqchip/gic-v3-its: Make GICv4_ITS_LIST_MAX globally available · ab60491e
      Marc Zyngier 提交于
      As we're about to make use of the maximum number of ITSs in
      a GICv4 system, let's make this value global (and rename it to
      GICv4_ITS_LIST_MAX).
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      ab60491e
    • S
      irqchip/gic-v3: Add support for Range Selector (RS) feature · eda0d04a
      Shanker Donthineni 提交于
      A new feature Range Selector (RS) has been added to GIC specification
      in order to support more than 16 CPUs at affinity level 0. New fields
      are introduced in SGI system registers (ICC_SGI0R_EL1, ICC_SGI1R_EL1
      and ICC_ASGI1R_EL1) to relax an artificial limit of 16 at level 0.
      
      - A new RSS field in ICC_CTLR_EL3, ICC_CTLR_EL1 and ICV_CTLR_EL1:
        [18] - Range Selector Support (RSS)
        0b0 = Targeted SGIs with affinity level 0 values of 0-15 are supported.
        0b1 = Targeted SGIs with affinity level 0 values of 0-255 are supported.
      
      - A new RS field in ICC_SGI0R_EL1, ICC_SGI1R_EL1 and ICC_ASGI1R_EL1:
        [47:44] - RangeSelector (RS) which group of 16 TargetList[n] field
                  TargetList[n] represents aff0 value ((RS*16)+n)
                  When ICC_CTLR_EL3.RSS==0 or ICC_CTLR_EL1.RSS==0, RS is RES0.
      
      - A new RSS field in GICD_TYPER:
        [26] - Range Selector Support (RSS)
        0b0 = Targeted SGIs with affinity level 0 values of 0-15 are supported.
        0b1 = Targeted SGIs with affinity level 0 values of 0-255 are supported.
      Signed-off-by: NShanker Donthineni <shankerd@codeaurora.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      eda0d04a
    • M
      irqdomain: Move revmap_trees_mutex to struct irq_domain · f1d78358
      Masahiro Yamada 提交于
      The revmap_trees_mutex protects domain->revmap_tree.  There is no
      need to make it global because it is allowed to modify revmap_tree
      of two different domains concurrently.  Having said that, this would
      not be a actual bottleneck because the interrupt map/unmap does not
      occur quite often.  Rather, the motivation is to tidy up the code
      from a data structure point of view.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      f1d78358
  4. 17 10月, 2017 1 次提交
  5. 04 10月, 2017 11 次提交
  6. 03 10月, 2017 2 次提交
  7. 01 10月, 2017 2 次提交
    • P
      udp: perform source validation for mcast early demux · bc044e8d
      Paolo Abeni 提交于
      The UDP early demux can leverate the rx dst cache even for
      multicast unconnected sockets.
      
      In such scenario the ipv4 source address is validated only on
      the first packet in the given flow. After that, when we fetch
      the dst entry  from the socket rx cache, we stop enforcing
      the rp_filter and we even start accepting any kind of martian
      addresses.
      
      Disabling the dst cache for unconnected multicast socket will
      cause large performace regression, nearly reducing by half the
      max ingress tput.
      
      Instead we factor out a route helper to completely validate an
      skb source address for multicast packets and we call it from
      the UDP early demux for mcast packets landing on unconnected
      sockets, after successful fetching the related cached dst entry.
      
      This still gives a measurable, but limited performance
      regression:
      
      		rp_filter = 0		rp_filter = 1
      edmux disabled:	1182 Kpps		1127 Kpps
      edmux before:	2238 Kpps		2238 Kpps
      edmux after:	2037 Kpps		2019 Kpps
      
      The above figures are on top of current net tree.
      Applying the net-next commit 6e617de8 ("net: avoid a full
      fib lookup when rp_filter is disabled.") the delta with
      rp_filter == 0 will decrease even more.
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc044e8d
    • P
      IPv4: early demux can return an error code · 7487449c
      Paolo Abeni 提交于
      Currently no error is emitted, but this infrastructure will
      used by the next patch to allow source address validation
      for mcast sockets.
      Since early demux can do a route lookup and an ipv4 route
      lookup can return an error code this is consistent with the
      current ipv4 route infrastructure.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7487449c
  8. 29 9月, 2017 6 次提交
  9. 28 9月, 2017 3 次提交
    • K
      timer: Prepare to change timer callback argument type · 686fef92
      Kees Cook 提交于
      Modern kernel callback systems pass the structure associated with a
      given callback to the callback function. The timer callback remains one
      of the legacy cases where an arbitrary unsigned long argument continues
      to be passed as the callback argument. This has several problems:
      
      - This bloats the timer_list structure with a normally redundant
        .data field.
      
      - No type checking is being performed, forcing callbacks to do
        explicit type casts of the unsigned long argument into the object
        that was passed, rather than using container_of(), as done in most
        of the other callback infrastructure.
      
      - Neighboring buffer overflows can overwrite both the .function and
        the .data field, providing attackers with a way to elevate from a buffer
        overflow into a simplistic ROP-like mechanism that allows calling
        arbitrary functions with a controlled first argument.
      
      - For future Control Flow Integrity work, this creates a unique function
        prototype for timer callbacks, instead of allowing them to continue to
        be clustered with other void functions that take a single unsigned long
        argument.
      
      This adds a new timer initialization API, which will ultimately replace
      the existing setup_timer(), setup_{deferrable,pinned,etc}_timer() family,
      named timer_setup() (to mirror hrtimer_setup(), making instances of its
      use much easier to grep for).
      
      In order to support the migration of existing timers into the new
      callback arguments, timer_setup() casts its arguments to the existing
      legacy types, and explicitly passes the timer pointer as the legacy
      data argument. Once all setup_*timer() callers have been replaced with
      timer_setup(), the casts can be removed, and the data argument can be
      dropped with the timer expiration code changed to just pass the timer
      to the callback directly.
      
      Since the regular pattern of using container_of() during local variable
      declaration repeats the need for the variable type declaration
      to be included, this adds a helper modeled after other from_*()
      helpers that wrap container_of(), named from_timer(). This helper uses
      typeof(*variable), removing the type redundancy and minimizing the need
      for line wraps in forthcoming conversions from "unsigned data long" to
      "struct timer_list *" in the timer callbacks:
      
      -void callback(unsigned long data)
      +void callback(struct timer_list *t)
      {
      -   struct some_data_structure *local = (struct some_data_structure *)data;
      +   struct some_data_structure *local = from_timer(local, t, timer);
      
      Finally, in order to support the handful of timer users that perform
      open-coded assignments of the .function (and .data) fields, provide
      cast macros (TIMER_FUNC_TYPE and TIMER_DATA_TYPE) that can be used
      temporarily. Once conversion has been completed, these can be globally
      trivially removed.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20170928133817.GA113410@beast
      686fef92
    • R
      net/mlx5: Check device capability for maximum flow counters · 16f1c5bb
      Raed Salem 提交于
      Added check for the maximal number of flow counters attached
      to rule (FTE).
      
      Fixes: bd5251db ('net/mlx5_core: Introduce flow steering destination of type counter')
      Signed-off-by: NRaed Salem <raeds@mellanox.com>
      Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      16f1c5bb
    • I
      net/mlx5: Fix FPGA capability location · 99d3cd27
      Inbar Karmy 提交于
      Currently, FPGA capability is located in (mdev)->caps.hca_cur,
      change the location to be (mdev)->caps.fpga,
      since hca_cur is reserved for HCA device capabilities.
      
      Fixes: e29341fb ("net/mlx5: FPGA, Add basic support for Innova")
      Signed-off-by: NInbar Karmy <inbark@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      99d3cd27
  10. 27 9月, 2017 1 次提交
  11. 26 9月, 2017 1 次提交
    • M
      percpu: make this_cpu_generic_read() atomic w.r.t. interrupts · e88d62cd
      Mark Rutland 提交于
      As raw_cpu_generic_read() is a plain read from a raw_cpu_ptr() address,
      it's possible (albeit unlikely) that the compiler will split the access
      across multiple instructions.
      
      In this_cpu_generic_read() we disable preemption but not interrupts
      before calling raw_cpu_generic_read(). Thus, an interrupt could be taken
      in the middle of the split load instructions. If a this_cpu_write() or
      RMW this_cpu_*() op is made to the same variable in the interrupt
      handling path, this_cpu_read() will return a torn value.
      
      For native word types, we can avoid tearing using READ_ONCE(), but this
      won't work in all cases (e.g. 64-bit types on most 32-bit platforms).
      This patch reworks this_cpu_generic_read() to use READ_ONCE() where
      possible, otherwise falling back to disabling interrupts.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e88d62cd