1. 17 5月, 2016 40 次提交
    • D
      bpf, arm64: add support for constant blinding · 26eb042e
      Daniel Borkmann 提交于
      This patch adds recently added constant blinding helpers into the
      arm64 eBPF JIT. In the bpf_int_jit_compile() path, requirements are
      to utilize bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair for rewriting the program into a blinded one, and to map the
      BPF_REG_AX register to a CPU register. The mapping is on x9.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NZi Shen Lim <zlim.lnx@gmail.com>
      Acked-by: NYang Shi <yang.shi@linaro.org>
      Tested-by: NYang Shi <yang.shi@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26eb042e
    • D
      bpf, x86: add support for constant blinding · 959a7579
      Daniel Borkmann 提交于
      This patch adds recently added constant blinding helpers into the
      x86 eBPF JIT. In the bpf_int_jit_compile() path, requirements are
      to utilize bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair for rewriting the program into a blinded one, and to map the
      BPF_REG_AX register to a CPU register. The mapping of BPF_REG_AX
      is at non-callee saved register r10, and thus shared with cached
      skb->data used for ld_abs/ind and not in every program type needed.
      When blinding is not used, there's zero additional overhead in the
      generated image.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      959a7579
    • D
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann 提交于
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NElena Reshetova <elena.reshetova@intel.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f3446bb
    • D
      bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis · d1c55ab5
      Daniel Borkmann 提交于
      Since the blinding is strictly only called from inside eBPF JITs,
      we need to change signatures for bpf_int_jit_compile() and
      bpf_prog_select_runtime() first in order to prepare that the
      eBPF program we're dealing with can change underneath. Hence,
      for call sites, we need to return the latest prog. No functional
      change in this patch.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1c55ab5
    • D
      bpf: add bpf_patch_insn_single helper · c237ee5e
      Daniel Borkmann 提交于
      Move the functionality to patch instructions out of the verifier
      code and into the core as the new bpf_patch_insn_single() helper
      will be needed later on for blinding as well. No changes in
      functionality.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c237ee5e
    • D
      bpf, x86/arm64: remove useless checks on prog · 93a73d44
      Daniel Borkmann 提交于
      There is never such a situation, where bpf_int_jit_compile() is
      called with either prog as NULL or len as 0, so the tests are
      unnecessary and confusing as people would just copy them. s390
      doesn't have them, so no change is needed there.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a73d44
    • D
      bpf: split HAVE_BPF_JIT into cBPF and eBPF variant · 6077776b
      Daniel Borkmann 提交于
      Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.
      
      Current cBPF ones:
      
        # git grep -n HAVE_CBPF_JIT arch/
        arch/arm/Kconfig:44:    select HAVE_CBPF_JIT
        arch/mips/Kconfig:18:   select HAVE_CBPF_JIT if !CPU_MICROMIPS
        arch/powerpc/Kconfig:129:       select HAVE_CBPF_JIT
        arch/sparc/Kconfig:35:  select HAVE_CBPF_JIT
      
      Current eBPF ones:
      
        # git grep -n HAVE_EBPF_JIT arch/
        arch/arm64/Kconfig:61:  select HAVE_EBPF_JIT
        arch/s390/Kconfig:126:  select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
        arch/x86/Kconfig:94:    select HAVE_EBPF_JIT                    if X86_64
      
      Later code also needs this facility to check for eBPF JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6077776b
    • D
      bpf: move bpf_jit_enable declaration · c94987e4
      Daniel Borkmann 提交于
      Move the bpf_jit_enable declaration to the filter.h file where
      most other core code is declared, also since we're going to add
      a second knob there.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c94987e4
    • D
      bpf: minor cleanups in ebpf code · 4936e352
      Daniel Borkmann 提交于
      Besides others, remove redundant comments where the code is self
      documenting enough, and properly indent various bpf_verifier_ops
      and bpf_prog_type_list declarations. Moreover, remove two exports
      that actually have no module user.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4936e352
    • V
      net: dsa: mv88e6xxx: remove bridge work · 553eb544
      Vivien Didelot 提交于
      Now that the bridge code defers the switchdev port state setting, there
      is no need to defer the port STP state change within the mv88e6xxx code.
      Thus get rid of the driver's bridge work code.
      
      This also fixes a race condition where the DSA layer assumes that the
      bridge code already set the unbridged port's STP state to Disabled
      before restoring the Forwarding state.
      
      As a consequence, this also fixes the FDB flush for the unbridged port
      which now correctly occurs during the Forwarding to Disabled transition.
      
      Fixes: 0bc05d58 ("switchdev: allow caller to explicitly request attr_set as deferred")
      Reported-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      553eb544
    • D
      net: vrf: protect changes to private data with rcu · b0e95ccd
      David Ahern 提交于
      One cpu can be processing packets which includes using the cached route
      entries in the vrf device's private data and on another cpu the device
      gets deleted which releases the routes and sets the pointers in net_vrf
      to NULL. This results in datapath dereferencing a NULL pointer.
      
      Fix by protecting access to dst's with rcu.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e95ccd
    • E
      tcp: minor optimizations around tcp_hdr() usage · ea1627c2
      Eric Dumazet 提交于
      tcp_hdr() is slightly more expensive than using skb->data in contexts
      where we know they point to the same byte.
      
      In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
      as tcp header has not been pulled yet.
      
      In output path, the same can be said when we just pushed the tcp header
      in the skb, in tcp_transmit_skb() and tcp_make_synack()
      
      Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
      tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
      so that compiler can further optimize and avoid a reload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea1627c2
    • N
      netlink: kill nla_put_u64() · 50225243
      Nicolas Dichtel 提交于
      This function is not used anymore. nla_put_u64_64bit() should be used
      instead.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50225243
    • E
      sock: propagate __sock_cmsg_send() error · 2632616b
      Eric Dumazet 提交于
      __sock_cmsg_send() might return different error codes, not only -EINVAL.
      
      Fixes: 24025c46 ("ipv4: process socket-level control messages in IPv4")
      Fixes: ad1e46a8 ("ipv6: process socket-level control messages in IPv6")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2632616b
    • A
      net: qrtr: fix build problems · a986a05d
      Arnd Bergmann 提交于
      Having multiple loadable modules with the same name cannot work
      with modprobe, and having both net/qrtr/smd.ko and drivers/soc/qcom/smd.ko
      results in a (somewhat cryptic) build error:
      
      ERROR: "qcom_smd_driver_unregister" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_driver_register" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_set_drvdata" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_send" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_get_drvdata" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_driver_unregister" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_driver_register" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_set_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_send" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_get_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      
      Also, the qrtr driver uses the SMD interface and has a Kconfig dependency,
      but also allows for compile-testing when SMD is disabled. However, if
      with QCOM_SMD=m and COMPILE_TEST=y we can end up with QRTR_SMD=y and
      that fails with a related link error.
      
      The changes the dependency so we can still compile-test the driver but
      not have it built-in if SMD is a module, to avoid running in the broken
      configuration, and changes the Makefile to provide the driver under
      a different module name.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: bdabad3e ("net: Add Qualcomm IPC router")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a986a05d
    • D
      Merge branch 'tc_flower_offload' · 148bd3a3
      David S. Miller 提交于
      Amir Vadai says:
      
      ====================
      sched,mlx5: Offloaded TC flower filter statistics
      
      This patchset introduces counters support for offloaded cls_flower filters.
      When the user calls 'tc show -s ..', fl_dump is called.
      Before fl_dump() returns the statistics, it calls the NIC driver (using a new
      ndo_setup_tc() command - TC_CLSFLOWER_STATS) to read the hardware counters and
      update the statistics accordingly. A new TC action op was added (stats_update())
      to be used by the NIC driver to update the statistics.
      
      Patchset was applied and tested over commit ed7cbbce ("udp: Resolve NULL pointer
      dereference over flow-based vxlan device")
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      148bd3a3
    • A
      net/mlx5e: Hardware offloaded flower filter statistics support · aad7e08d
      Amir Vadai 提交于
      Introduce support in updating statistics of offloaded TC flower
      classifiers. Currently only the DROP action is supported.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aad7e08d
    • A
      net/mlx5_core: Flow counters infrastructure · 43a335e0
      Amir Vadai 提交于
      If a counter has the aging flag set when created, it is added to a list
      of counters that will be queried periodically from a workqueue.  query
      result and last use timestamp are cached.
      add/del counter must be very efficient since thousands of such
      operations might be issued in a second.
      There is only a single reference to counters without aging, therefore
      no need for locks.
      But, counters with aging enabled are stored in a list. In order to make
      code as lockless as possible, all the list manipulation and access to
      hardware is done from a single context - the periodic counters query
      thread.
      
      The hardware supports multiple counters per FTE, however currently we
      are using one counter for each FTE.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a335e0
    • A
      net/mlx5_core: Introduce flow steering destination of type counter · bd5251db
      Amir Vadai 提交于
      When adding a flow steering rule with a counter, need to supply a
      destination of type MLX5_FLOW_DESTINATION_TYPE_COUNTER, with a pointer
      to a struct mlx5_fc.
      Also, MLX5_FLOW_CONTEXT_ACTION_COUNT bit should be set in the action.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd5251db
    • A
      net/mlx5_core: Firmware commands to support flow counters · 9dc0b289
      Amir Vadai 提交于
      Getting packet/byte statistics on flows is done through flow counters.
      Implement the firmware commands to alloc, free and query flow counters.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dc0b289
    • A
      net/mlx5_core: Use a macro in mlx5_command_str() · 42ca502e
      Amir Vadai 提交于
      Use a macro instead of copying the OP name.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42ca502e
    • A
      net/sched: cls_flower: Hardware offloaded filters statistics support · 10cbc684
      Amir Vadai 提交于
      Introduce a new command in ndo_setup_tc() for hardware offloaded
      filters, to call the NIC driver, and make it update the statistics.
      This will be done before dumping the filter and its statistics.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10cbc684
    • A
      net/sched: act_gact: Update statistics when offloaded to hardware · 9fea47d9
      Amir Vadai 提交于
      Implement the stats_update callback that will be called by NIC drivers
      for hardware offloaded filters.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fea47d9
    • A
      net/sched: Enable netdev drivers to update statistics of offloaded actions · 38040702
      Amir Vadai 提交于
      Introduce stats_update callback. netdev driver could call it for offloaded
      actions to update the basic statistics (packets, bytes and last use).
      Since bstats_update() and bstats_cpu_update() use skb as an argument to
      get the counters, _bstats_update() and _bstats_cpu_update(), that get
      bytes and packets as arguments, were added.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38040702
    • D
      Merge branch 'pxa168_eth-perf' · 388665a9
      David S. Miller 提交于
      Jisheng Zhang says:
      
      ====================
      net: pxa168_eth: improve performance
      
      This series is to improve the pxa168_eth driver performance by using
      {readl|writel}_relaxed or appropriate memory barriers.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      388665a9
    • J
      net: pxa168_eth: Use dma_wmb/rmb where appropriate · b17d1559
      Jisheng Zhang 提交于
      Update the pxa168_eth driver to use the dma_rmb/wmb calls instead of the
      full barriers in order to improve performance: reduced 97ns/39ns on
      average in tx/rx path on Marvell BG4CT platform.
      Signed-off-by: NJisheng Zhang <jszhang@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b17d1559
    • J
      net: pxa168_eth: use {readl|writel}_relaxed instead of readl/writel · 3ed68782
      Jisheng Zhang 提交于
      Since appropriate memory barriers are already there, use the relaxed
      version to improve performance a bit.
      Signed-off-by: NJisheng Zhang <jszhang@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ed68782
    • J
      vxlan: set mac_header correctly in GPE mode · 8be0cfa4
      Jiri Benc 提交于
      For VXLAN-GPE, the interface is ARPHRD_NONE, thus we need to reset
      mac_header after pulling the outer header.
      
      v2: Put the code to the existing conditional block as suggested by
          Shmulik Ladkani.
      
      Fixes: e1e5314d ("vxlan: implement GPE")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Reviewed-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8be0cfa4
    • D
      Merge branch 'xen-netback-control-ring' · 41ae56ce
      David S. Miller 提交于
      Paul Durrant says:
      
      ====================
      xen-netback: support for control ring
      
      My recent patch to import an up-to-date include/xen/interface/io/netif.h
      from the Xen Project brought in the necessary definitions to support the
      new control shared ring and protocol. This patch series updates xen-netback
      to support the new ring.
      
      Patch #1 adds the necessary boilerplate to map the control ring and handle
      messages. No implementation of the new protocol is included in this patch
      so that it can be kept to a reasonable size.
      
      Patch #2 adds the protocol implementation.
      
      Patch #3 adds support for passing has values calculated by xen-netback to
      capable frontends.
      
      Patch #4 adds support for accepting hash values calculated by capable
      frontends and using them the set the socket buffer hash.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41ae56ce
    • P
      xen-netback: use hash value from the frontend · c2d09fde
      Paul Durrant 提交于
      My recent patch to include/xen/interface/io/netif.h defines a new extra
      info type that can be used to pass hash values between backend and guest
      frontend.
      
      This patch adds code to xen-netback to use the value in a hash extra
      info fragment passed from the guest frontend in a transmit-side
      (i.e. netback receive side) packet to set the skb hash accordingly.
      Signed-off-by: NPaul Durrant <paul.durrant@citrix.com>
      Acked-by: NWei Liu <wei.liu2@citrix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2d09fde
    • P
      xen-netback: pass hash value to the frontend · f07f9893
      Paul Durrant 提交于
      My recent patch to include/xen/interface/io/netif.h defines a new extra
      info type that can be used to pass hash values between backend and guest
      frontend.
      
      This patch adds code to xen-netback to pass hash values calculated for
      guest receive-side packets (i.e. netback transmit side) to the frontend.
      Signed-off-by: NPaul Durrant <paul.durrant@citrix.com>
      Acked-by: NWei Liu <wei.liu2@citrix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f07f9893
    • P
      xen-netback: add control protocol implementation · 40d8abde
      Paul Durrant 提交于
      My recent patch to include/xen/interface/io/netif.h defines a new shared
      ring (in addition to the rx and tx rings) for passing control messages
      from a VM frontend driver to a backend driver.
      
      A previous patch added the necessary boilerplate for mapping the control
      ring from the frontend, should it be created. This patch adds
      implementations for each of the defined protocol messages.
      Signed-off-by: NPaul Durrant <paul.durrant@citrix.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Acked-by: NWei Liu <wei.liu2@citrix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40d8abde
    • P
      xen-netback: add control ring boilerplate · 4e15ee2c
      Paul Durrant 提交于
      My recent patch to include/xen/interface/io/netif.h defines a new shared
      ring (in addition to the rx and tx rings) for passing control messages
      from a VM frontend driver to a backend driver.
      
      This patch adds the necessary code to xen-netback to map this new shared
      ring, should it be created by a frontend, but does not add implementations
      for any of the defined protocol messages. These are added in a subsequent
      patch for clarity.
      Signed-off-by: NPaul Durrant <paul.durrant@citrix.com>
      Acked-by: NWei Liu <wei.liu2@citrix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e15ee2c
    • D
      Merge branch 'cls_u32_hw_sw' · 1ca46734
      David S. Miller 提交于
      Sridhar Samudrala says:
      
      ====================
      Enable SW only or HW only offloads with u32 classifier
      
      This set of patches export TCA_CLS_FLAGS_SKIP_HW to userspace and also
      introduces another flag TCA_CLS_FLAGS_SKIP_SW. These flags enable offloading
      u32 filters to either SW or HW only.
      
      The default semantics with no flags is to add the filter to HW if possible and
      also into SW.
      With SKIP_HW flag, the filter is only added to SW.
      With SKIP_SW flag, the filter is added to HW and an error is returned
      to user on failure.
      These flags are mutually exclusive.
      There was an earlier discussion on these semantics in the following email
      thread.
      	http://thread.gmane.org/gmane.linux.network/401733
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ca46734
    • S
      net: cls_u32: Add support for skip-sw flag to tc u32 classifier. · d34e3e18
      Samudrala, Sridhar 提交于
      On devices that support TC U32 offloads, this flag enables a filter to be
      added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
      default without any flags, the filter is added to both HW and SW, but no
      error checks are done in case of failure to add to HW. With skip-sw,
      failure to add to HW is treated as an error.
      
      Here is a sample script that adds 2 filters, one with skip-sw and the other
      with skip-hw flag.
      
         # add ingress qdisc
         tc qdisc add dev p4p1 ingress
      
         # enable hw tc offload.
         ethtool -K p4p1 hw-tc-offload on
      
         # add u32 filter with skip-sw flag.
         tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
            handle 800:0:1 u32 ht 800: flowid 800:1 \
            skip-sw \
            match ip src 192.168.1.0/24 \
            action drop
      
         # add u32 filter with skip-hw flag.
         tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
            handle 800:0:2 u32 ht 800: flowid 800:2 \
            skip-hw \
            match ip src 192.168.2.0/24 \
            action drop
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d34e3e18
    • S
    • D
      Merge branch 'hv_netvsc-races' · 860d7ef6
      David S. Miller 提交于
      Vitaly Kuznetsov says:
      
      ====================
      hv_netvsc: avoid races on mtu change/set channels
      
      Changes since v1:
      - Rebased to net-next [Haiyang Zhang]
      
      Original description:
      
      MTU change and set channels operations are implemented as netvsc device
      re-creation destroying internal structures (struct net_device stays). This
      is really unfortunate but there is no support from Hyper-V host to do it
      in a different way. Such re-creation is unsurprisingly racy, Haiyang
      reported a crash when netvsc_change_mtu() is racing with
      netvsc_link_change() but I was able to identify additional races upon
      investigation. Both netvsc_set_channels() and netvsc_change_mtu() race
      against:
      1) netvsc_link_change()
      2) netvsc_remove()
      3) netvsc_send()
      
      To solve these issues without introducing new locks some refactoring is
      required. We need to get rid of very complex link graph in all the
      internal structures and avoid traveling through structures which are being
      removed.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      860d7ef6
    • V
      hv_netvsc: set nvdev link after populating chn_table · 88098834
      Vitaly Kuznetsov 提交于
      Crash in netvsc_send() is observed when netvsc device is re-created on
      mtu change/set channels. The crash is caused by dereferencing of NULL
      channel pointer which comes from chn_table. The root cause is a mixture
      of two facts:
      - we set nvdev pointer in net_device_context in alloc_net_device()
        before we populate chn_table.
      - we populate chn_table[0] only.
      
      The issue could be papered over by checking channel != NULL in
      netvsc_send() but populating the whole chn_table and writing the
      nvdev pointer afterwards seems more appropriate.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88098834
    • V
      hv_netvsc: synchronize netvsc_change_mtu()/netvsc_set_channels() with netvsc_remove() · 6da7225f
      Vitaly Kuznetsov 提交于
      When netvsc device is removed during mtu change or channels setup we get
      into troubles as both paths are trying to remove the device. Synchronize
      them with start_remove flag and rtnl lock.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da7225f
    • V
      hv_netvsc: get rid of struct net_device pointer in struct netvsc_device · 0a1275ca
      Vitaly Kuznetsov 提交于
      Simplify netvsvc pointer graph by getting rid of the redundant ndev
      pointer. We can always get a pointer to struct net_device from somewhere
      else.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a1275ca