1. 17 3月, 2017 39 次提交
    • S
      netvsc: avoid race with callback · 6de38af6
      stephen hemminger 提交于
      Change the argument to channel callback from the channel pointer
      to the internal data structure containing per-channel info.
      This avoids any possible races when callback happens during
      initialization and makes IRQ code simpler.
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6de38af6
    • D
      Merge branch 'bpf-inline-lookups' · 3a70418b
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf: inline bpf_map_lookup_elem()
      
      bpf_map_lookup_elem() is one of the most frequently used helper functions.
      Improve JITed program performance by inlining this helper.
      
      bpf_map_type	before  after
      hash		58M	74M
      array		174M	280M
      
      The values are number of lookups per second in ideal conditions
      measured by micro-benchmark in patch 6.
      
      The 'perf report' for HASH map type:
      before:
          54.23%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
          14.24%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
           8.84%  map_perf_test  [kernel.kallsyms]  [k] htab_map_lookup_elem
           5.93%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
           2.30%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           1.49%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      after:
          60.03%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
          18.07%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
           2.91%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           1.94%  map_perf_test  [kernel.kallsyms]  [k] _einittext
           1.90%  map_perf_test  [kernel.kallsyms]  [k] __audit_syscall_exit
           1.72%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      so the cost of htab_map_lookup_elem() and bpf_map_lookup_elem()
      is gone after inlining.
      
      'per-cpu' and 'lru' map types can be optimized similarly in the future.
      
      Note the sparse will complain that bpf is addictive ;)
      kernel/bpf/hashtab.c:438:19: sparse: subtraction of functions? Share your drugs
      kernel/bpf/verifier.c:3342:38: sparse: subtraction of functions? Share your drugs
      it's not a new warning, just in new places.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a70418b
    • A
      samples/bpf: add map_lookup microbenchmark · 95ff141e
      Alexei Starovoitov 提交于
      $ map_perf_test 128
      speed of HASH bpf_map_lookup_elem() in lookups per second
      	w/o JIT		w/JIT
      before	46M		58M
      after	42M		74M
      
      perf report
      before:
          54.23%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
          14.24%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
           8.84%  map_perf_test  [kernel.kallsyms]  [k] htab_map_lookup_elem
           5.93%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
           2.30%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           1.49%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      after:
          60.03%  map_perf_test  [kernel.kallsyms]  [k] __htab_map_lookup_elem
          18.07%  map_perf_test  [kernel.kallsyms]  [k] lookup_elem_raw
           2.91%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           1.94%  map_perf_test  [kernel.kallsyms]  [k] _einittext
           1.90%  map_perf_test  [kernel.kallsyms]  [k] __audit_syscall_exit
           1.72%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      Notice that bpf_map_lookup_elem() and htab_map_lookup_elem() are trivial
      functions, yet they take sizeable amount of cpu time.
      htab_map_gen_lookup() removes bpf_map_lookup_elem() and converts
      htab_map_lookup_elem() into three BPF insns which causing cpu time
      for bpf_prog_da4fc6a3f41761a2() slightly increase.
      
      $ map_perf_test 256
      speed of ARRAY bpf_map_lookup_elem() in lookups per second
      	w/o JIT		w/JIT
      before	97M		174M
      after	64M		280M
      
      before:
          37.33%  map_perf_test  [kernel.kallsyms]  [k] array_map_lookup_elem
          13.95%  map_perf_test  [kernel.kallsyms]  [k] bpf_map_lookup_elem
           6.54%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           4.57%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      after:
          32.86%  map_perf_test  [kernel.kallsyms]  [k] bpf_prog_da4fc6a3f41761a2
           6.54%  map_perf_test  [kernel.kallsyms]  [k] kprobe_ftrace_handler
      
      array_map_gen_lookup() removes calls to array_map_lookup_elem()
      and bpf_map_lookup_elem() and replaces them with 7 bpf insns.
      
      The performance without JIT is slower, since executing extra insns
      in the interpreter is slower than running native C code,
      but with JIT the performance gains are obvious,
      since native C->x86 code is replaced with fewer bpf->x86 instructions.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95ff141e
    • A
      bpf: inline htab_map_lookup_elem() · 9015d2f5
      Alexei Starovoitov 提交于
      Optimize:
      bpf_call
        bpf_map_lookup_elem
          map->ops->map_lookup_elem
            htab_map_lookup_elem
              __htab_map_lookup_elem
      into:
      bpf_call
        __htab_map_lookup_elem
      
      to improve performance of JITed programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9015d2f5
    • A
      bpf: add helper inlining infra and optimize map_array lookup · 81ed18ab
      Alexei Starovoitov 提交于
      Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
      into a sequence of bpf instructions.
      When JIT is on the sequence of bpf instructions is the sequence
      of native cpu instructions with significantly faster performance
      than indirect call and two function's prologue/epilogue.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81ed18ab
    • A
      bpf: adjust insn_aux_data when patching insns · 8041902d
      Alexei Starovoitov 提交于
      convert_ctx_accesses() replaces single bpf instruction with a set of
      instructions. Adjust corresponding insn_aux_data while patching.
      It's needed to make sure subsequent 'for(all insn)' loops
      have matching insn and insn_aux_data.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8041902d
    • A
      bpf: refactor fixup_bpf_calls() · 79741b3b
      Alexei Starovoitov 提交于
      reduce indent and make it iterate over instructions similar to
      convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79741b3b
    • A
      bpf: move fixup_bpf_calls() function · e245c5c6
      Alexei Starovoitov 提交于
      no functional change.
      move fixup_bpf_calls() to verifier.c
      it's being refactored in the next patch
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e245c5c6
    • S
      tcp: remove tcp_tw_recycle · 4396e461
      Soheil Hassas Yeganeh 提交于
      The tcp_tw_recycle was already broken for connections
      behind NAT, since the per-destination timestamp is not
      monotonically increasing for multiple machines behind
      a single destination address.
      
      After the randomization of TCP timestamp offsets
      in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
      for each connection), the tcp_tw_recycle is broken for all
      types of connections for the same reason: the timestamps
      received from a single machine is not monotonically increasing,
      anymore.
      
      Remove tcp_tw_recycle, since it is not functional. Also, remove
      the PAWSPassive SNMP counter since it is only used for
      tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
      since the strict argument is only set when tcp_tw_recycle is
      enabled.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4396e461
    • S
      tcp: remove per-destination timestamp cache · d82bae12
      Soheil Hassas Yeganeh 提交于
      Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
      randomizes TCP timestamps per connection. After this commit,
      there is no guarantee that the timestamps received from the
      same destination are monotonically increasing. As a result,
      the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
      in struct tcp_metrics_block) is broken and cannot be relied upon.
      
      Remove the per-destination timestamp cache and all related code
      paths.
      
      Note that this cache was already broken for caching timestamps of
      multiple machines behind a NAT sharing the same address.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d82bae12
    • D
      Merge branch 'sunvnet-better-connection-management' · 8b705f52
      David S. Miller 提交于
      Shannon Nelson says:
      
      ====================
      sunvnet: better connection management
      
      These patches remove some problems in handling of carrier state
      with the ldmvsw vswitch, remove  an xoff misuse in sunvnet, and
      add stats for debug and tracking of point-to-point connections
      between the ldom VMs.
      
      v2:
       - added ldmvsw ndo_open to reset the LDC channel
       - updated copyrights
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b705f52
    • S
      sunvnet: xoff not needed when removing port link · 9c5a3a1f
      Shannon Nelson 提交于
      The sunvnet netdev is connected to the controlling ldom's vswitch
      for network bridging.  However, for higher performance between ldoms,
      there also is a channel between each client ldom.  These connections are
      represented in the sunvnet driver by a queue for each ldom.  The driver
      uses select_queue to tell the stack which queue to use by tracking the mac
      addresses on the other end of each port.  When a connected ldom shuts down,
      the driver receives an LDC_EVENT_RESET and the port is removed from the
      driver, thus a queue with no ldom on the other end will never be selected
      for Tx.
      
      The driver was trying to reinforce the "don't use this queue" notion with
      netif_tx_stop_queue() and netif_tx_wake_queue(), which really should only
      be used to signal a Tx queue is full (aka XOFF).  This misuse of queue
      state resulted in NETDEV WATCHDOG messages and lots of unnecessary calls
      into the driver's tx_timeout handler.  Simply removing these takes care
      of the problem.
      
      Orabug: 25190537
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c5a3a1f
    • S
      sunvnet: count multicast packets · b12a96f5
      Shannon Nelson 提交于
      Make sure multicast packets get counted in the device.
      
      Orabug: 25190537
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b12a96f5
    • S
      sunvnet: track port queues correctly · e1f1e5f7
      Shannon Nelson 提交于
      Track our used and unused queue indexies correctly.  Otherwise, as ports
      dropped out and returned, they all eventually ended up with the same
      queue index.
      
      Orabug: 25190537
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1f1e5f7
    • S
      sunvnet: add stats to track ldom to ldom packets and bytes · 0f512c84
      Shannon Nelson 提交于
      In this driver, there is a "port" created for the connection to each of
      the other ldoms; a netdev queue is mapped to each port, and they are
      collected under a single netdev.  The generic netdev statistics show
      us all the traffic in and out of our network device, but don't show
      individual queue/port stats.  This patch breaks out the traffic counts
      for the individual ports and gives us a little view into the state of
      those connections.
      
      Orabug: 25190537
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f512c84
    • S
      ldmvsw: better use of link up and down on ldom vswitch · 867fa150
      Shannon Nelson 提交于
      When an ldom VM is bound, the network vswitch infrastructure is set up for
      it, but was being forced 'UP' by the userland switch configuration script.
      When 'UP' but not actually connected to a running VM, the ipv6 neighbor
      probes fail (not a horrible thing) and start cluttering up the kernel logs.
      Funny thing: these are debug messages that never actually show up, but
      we do see the net_ratelimited messages that say N callbacks were
      suppressed.
      
      This patch defers the netif_carrier_on() until an actual link has been
      established with the VM, as indicated by receiving an LDC_EVENT_UP from
      the underlying LDC protocol.  Similarly, we take the link down when we
      see the LDC_EVENT_RESET.  Now when we see the ndo_open(), we reset the
      link to get things talking again.
      
      Orabug: 25525312
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      867fa150
    • J
      bonding: add 802.3ad support for 25G speeds · 19ddde1e
      Jarod Wilson 提交于
      Cut-n-paste enablement of 802.3ad bonding on 25G NICs, which currently
      report 0 as their bandwidth.
      
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NAndy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19ddde1e
    • C
      tcp_westwood: fix tcp_westwood_info() style mistakes · be7164cd
      chun Long 提交于
      replace comma to semi colons in tcp_westwood_info().
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be7164cd
    • R
      liquidio: use meaningful names for IRQs · 0c88a761
      Rick Farrington 提交于
      All IRQs owned by the PF and VF drivers share the same nondescript name
      "octeon"; this makes it difficult to setup interrupt affinity.
      
      Change the IRQ names to reflect their specific purpose:
      
          LiquidIO<id>-<func>-<type>-<queue pair num>
      
      Examples:
          LiquidIO0-pf0-rxtx-3
          LiquidIO1-vf1-rxtx-0
          LiquidIO0-pf0-aux
      
      We cannot use netdev->name for naming the IRQs because:
      
          1.  Early during init, the PF and VF drivers require interrupts to
              send/receive control data from the NIC firmware; so the PF and VF
              must request IRQs long before the netdev struct is registered.
      
          2.  The IRQ name can only be specified at the time it is requested.
              It cannot be changed after that.
      Signed-off-by: NRick Farrington <ricardo.farrington@cavium.com>
      Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
      Signed-off-by: NSatanand Burla <satananda.burla@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c88a761
    • R
      liquidio: remove/replace invalid code · b229487b
      Rick Farrington 提交于
      Remove invalid call to dma_sync_single_for_cpu() because previous DMA
      allocation was coherent--not streaming.  Remove code that references fields
      in struct list_head; replace it with calls to list_empty() and
      list_first_entry().  Also, add comment to clarify complicated if statement.
      Signed-off-by: NRick Farrington <ricardo.farrington@cavium.com>
      Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
      Signed-off-by: NDerek Chickles <derek.chickles@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b229487b
    • N
      netem: apply correct delay when rate throttling · 5080f39e
      Nik Unger 提交于
      I recently reported on the netem list that iperf network benchmarks
      show unexpected results when a bandwidth throttling rate has been
      configured for netem. Specifically:
      
      1) The measured link bandwidth *increases* when a higher delay is added
      2) The measured link bandwidth appears higher than the specified limit
      3) The measured link bandwidth for the same very slow settings varies significantly across
        machines
      
      The issue can be reproduced by using tc to configure netem with a
      512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
      veth pair between network namespaces, and then using iperf (or any
      other network benchmarking tool) to test throughput. Complete detailed
      instructions are in the original email chain here:
      https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
      
      There appear to be two underlying bugs causing these effects:
      
      - The first issue causes long delays when the rate is slow and no
        delay is configured (e.g., "rate 512kbit"). This is because SKBs are
        not orphaned when no delay is configured, so orphaning does not
        occur until *after* the rate-induced delay has been applied. For
        this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
        dramatically increases the measured bandwidth.
      
      - The second issue is that rate-induced delays are not correctly
        applied, allowing SKB delays to occur in parallel. The indended
        approach is to compute the delay for an SKB and to add this delay to
        the end of the current queue. However, the code does not detect
        existing SKBs in the queue due to improperly testing sch->q.qlen,
        which is nonzero even when packets exist only in the
        rbtree. Consequently, new SKBs do not wait for the current queue to
        empty. When packet delays vary significantly (e.g., if packet sizes
        are different), then this also causes unintended reordering.
      
      I modified the code to expect a delay (and orphan the SKB) when a rate
      is configured. I also added some defensive tests that correctly find
      the latest scheduled delivery time, even if it is (unexpectedly) for a
      packet in sch->q. I have tested these changes on the latest kernel
      (4.11.0-rc1+) and the iperf / ping test results are as expected.
      Signed-off-by: NNik Unger <njunger@uwaterloo.ca>
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5080f39e
    • D
      Merge branch 'sched-cleanups' · cd918afd
      David S. Miller 提交于
      Or Gerlitz says:
      
      ====================
      small set of sched cleanups
      
      Just two cleanups -- but for the 2nd one I think we need ack from
      Cong Wang to make sure this isn't actually a bug report..
      
      changes from V1:
        - addressed comment from Sergei to use 12 hex digits etc
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd918afd
    • O
      net/sched: fq_codel: Avoid set-but-unused variable · a5e6a3b0
      Or Gerlitz 提交于
      The code introduced by commit 2ccccf5f ("net_sched: update
      hierarchical backlog too") only sets prev_backlog in fq_codel_dequeue()
      but not using that anywhere, remove that setting.
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5e6a3b0
    • O
      net/sched: act_ife: Staticfy find_decode_metaid() · 4dba87b0
      Or Gerlitz 提交于
      As it's used only on that file.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dba87b0
    • S
      net: ethernet: bgmac: Allow MAC address to be specified in DTB · 2f771399
      Steve Lin 提交于
      Allows the BCMA version of the bgmac driver to obtain MAC address
      from the device tree.  If no MAC address is specified there, then
      the previous behavior (obtaining MAC address from SPROM) is
      used.
      Signed-off-by: NSteve Lin <steven.lin1@broadcom.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NJon Mason <jon.mason@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f771399
    • C
      net: ethernet: fs_enet: Remove useless includes · 01ac2994
      Christophe Leroy 提交于
      CONFIG_8xx is being deprecated. Since the includes dependent on
      CONFIG_8xx are useless, just drop them.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ac2994
    • C
      isdn: hardware: mISDN: Remove reference to CONFIG_8xx · b79df0fc
      Christophe Leroy 提交于
      CONFIG_8xx is deprecated and should soon be removed in favor
      of CONFIG_PPC_8xx.
      Anyway, hfc_multi_8xx.h only uses 8xx I/O ports which are
      linked to the CPM1 communication processor included in the 8xx
      rather than the 8xx itself.
      
      This patch therefore makes it dependent on CONFIG_CPM1 instead,
      like several other drivers.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b79df0fc
    • J
      net: mvneta: support suspend and resume · 9768b45c
      Jane Li 提交于
      Add basic support for handling suspend and resume.
      Signed-off-by: NJane Li <jiel@marvell.com>
      Reviewed-by: NJisheng Zhang <jszhang@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9768b45c
    • D
      Merge branch 'mlxsw-vrf' · 7e3f4f3a
      David S. Miller 提交于
      Jiri Pirko says:
      
      ====================
      mlxsw: Enable VRF offload
      
      Ido says:
      
      Packets received from netdevs enslaved to different VRF devices are
      forwarded using different FIB tables. In the Spectrum ASIC this is
      achieved by binding different router interfaces (RIFs) to different
      virtual routers (VRs). Each RIF represents an enslaved netdev and each
      VR has its own FIB table according to which packets are forwarded.
      
      The first three patches add an helper to check if a FIB rule is a
      default rule and extend the FIB notification chain to include the rule's
      info as part of the RULE_{ADD,DEL} events. This allows offloading
      drivers to sanitize the rules they don't support and flush their tables.
      
      The fourth patch introduces a small change in the VRF driver to allow
      capable drivers to more easily offload VRFs.
      
      Finally, the last patches gradually add support for VRFs in the mlxsw
      driver. First, on top of port netdevs, stacked LAG and VLAN devices and
      then on top of bridges.
      
      Some limitations I would like to point out:
      
      1) The old model where 'oif' / 'iif' rules were programmed for each L3
      master device isn't supported. Upon insertion of these rules the driver
      will flush its tables and forwarding will be done by the kernel instead.
      It's inferior in every way to the single 'l3mdev' rule, so this shouldn't
      be an issue.
      
      2) Inter-VRF routes pointing to a VRF device aren't offloaded. Packets
      hitting these routes will be forwarded by the kernel. Inter-VRF routes
      pointing to netdevs enslaved to a different VRF are offloaded.
      
      3) There's a small discrepancy between the kernel's datapath and the
      device's. By default, packets forwarded by the kernel first do a lookup
      in the local table and then in the VRF's table (assuming no match). In
      the device, lookup is done only in the VRF's table, which is probably
      the intended behavior. Changes in v2 allow user to properly re-order the
      default rules without triggering the abort mechanism.
      
      Changes in v3:
      * Remove 'l3mdev' from the matchall list, as it's related to the action
        and not the selector (David Ahern).
      * Use container_of() instead of typecasting (David Ahern).
      * Add David's Acked-by to the second patch.
      * Add an helper in IPv4 code to check if rule is a default rule (David
        Ahern).
      
      Changes in v2:
      * Drop default rule indication and allow re-ordering of default rules
        (David Ahern).
      * Remove ifdef around 'struct fib_rule_notifier_info' and drop redundant
        dependency on IP_MULTIPLE_TABLES from rocker and mlxsw.
      * Add David's Acked-by to the fourth patch.
      * Remove netif_is_vrf_master() and use netif_is_l3_master() instead
        (David Ahern).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e3f4f3a
    • I
      mlxsw: spectrum_router: Don't abort on l3mdev rules · c7f6e665
      Ido Schimmel 提交于
      Now that port netdevs can be enslaved to a VRF master we need to make
      sure the device's routing tables won't be flushed upon the insertion of
      a l3mdev rule.
      
      Note that we assume the notified l3mdev rule is a simple rule as used by
      the VRF master. We don't check for the presence of other selectors such
      as 'iif' and 'oif'.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7f6e665
    • I
      mlxsw: spectrum_router: Add support for VRFs on top of bridges · 3d70e458
      Ido Schimmel 提交于
      In a similar fashion to the previous patch, allow bridges and VLAN
      devices on top of bridges to be enslaved to a VRF master device.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d70e458
    • I
      mlxsw: spectrum_router: Add support for VRFs · 7179eb5a
      Ido Schimmel 提交于
      Allow port netdevs, LAG and VLAN devices stacked on top of these to be
      enslaved to a VRF master device.
      
      Upon enslavement, create a router interface (RIF) for the enslaved
      netdev and associate it with a virtual router (VR) based on the VRF's
      table ID.
      
      If a RIF already exists for the netdev (f.e., due to the existence of an
      IP address), then it's deleted and a new one is created with the
      appropriate VR binding.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7179eb5a
    • I
      mlxsw: spectrum_router: Don't destroy RIF if L3 slave · 9db032bb
      Ido Schimmel 提交于
      We usually destroy the netdev's router interface (RIF) when the last IP
      address is removed from it.
      
      However, we shouldn't do that if it's enslaved to an L3 master device.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9db032bb
    • I
      mlxsw: spectrum_router: Associate RIFs with correct VR · 57837885
      Ido Schimmel 提交于
      When a router interface (RIF) is created due to a netdev being enslaved
      to a VRF master, then it should be associated with the appropriate
      virtual router (VR) and not the default one.
      
      If netdev is a VRF slave, lookup the VR based on the VRF's table ID.
      Otherwise default to the MAIN table.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57837885
    • I
      net: vrf: Set slave's private flag before linking · fdeea7be
      Ido Schimmel 提交于
      Allow listeners of the subsequent CHANGEUPPER notification to retrieve
      the VRF's table ID by calling l3mdev_fib_table() with the slave netdev.
      Without this change, the netdev won't be considered an L3 slave and the
      function would return 0.
      
      This is consistent with other master device such as bridge and bond that
      set the slave's private flag before linking. It also makes
      do_vrf_{add,del}_slave() symmetric.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fdeea7be
    • I
      ipv4: fib_rules: Dump FIB rules when registering FIB notifier · 5d7bfd14
      Ido Schimmel 提交于
      In commit c3852ef7 ("ipv4: fib: Replay events when registering FIB
      notifier") we dumped the FIB tables and replayed the events to the
      passed notification block.
      
      However, we merely sent a RULE_ADD notification in case custom rules
      were in use. As explained in previous patches, this approach won't work
      anymore. Instead, we should notify the caller about all the FIB rules
      and let it act accordingly.
      
      Upon registration to the FIB notification chain, replay a RULE_ADD
      notification for each programmed FIB rule, custom or not. The integrity
      of the dump is ensured by the mechanism introduced in the above
      mentioned commit.
      
      Prevent regressions by making sure current listeners correctly sanitize
      the notified rules.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d7bfd14
    • I
      ipv4: fib_rules: Add notifier info to FIB rules notifications · 6a003a5f
      Ido Schimmel 提交于
      Whenever a FIB rule is added or removed, a notification is sent in the
      FIB notification chain. However, listeners don't have a way to tell
      which rule was added or removed.
      
      This is problematic as we would like to give listeners the ability to
      decide which action to execute based on the notified rule. Specifically,
      offloading drivers should be able to determine if they support the
      reflection of the notified FIB rule and flush their LPM tables in case
      they don't.
      
      Do that by adding a notifier info to these notifications and embed the
      common FIB rule struct in it.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a003a5f
    • I
      ipv4: fib_rules: Check if rule is a default rule · 3c71006d
      Ido Schimmel 提交于
      Currently, when non-default (custom) FIB rules are used, devices capable
      of layer 3 offloading flush their tables and let the kernel do the
      forwarding instead.
      
      When these devices' drivers are loaded they register to the FIB
      notification chain, which lets them know about the existence of any
      custom FIB rules. This is done by sending a RULE_ADD notification based
      on the value of 'net->ipv4.fib_has_custom_rules'.
      
      This approach is problematic when VRF offload is taken into account, as
      upon the creation of the first VRF netdev, a l3mdev rule is programmed
      to direct skbs to the VRF's table.
      
      Instead of merely reading the above value and sending a single RULE_ADD
      notification, we should iterate over all the FIB rules and send a
      detailed notification for each, thereby allowing offloading drivers to
      sanitize the rules they don't support and potentially flush their
      tables.
      
      While l3mdev rules are uniquely marked, the default rules are not.
      Therefore, when they are being notified they might invoke offloading
      drivers to unnecessarily flush their tables.
      
      Solve this by adding an helper to check if a FIB rule is a default rule.
      Namely, its selector should match all packets and its action should
      point to the local, main or default tables.
      
      As noted by David Ahern, uniquely marking the default rules is
      insufficient. When using VRFs, it's common to avoid false hits by moving
      the rule for the local table to just before the main table:
      
      Default configuration:
      $ ip rule show
      0:      from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      
      Common configuration with VRFs:
      $ ip rule show
      1000:   from all lookup [l3mdev-table]
      32765:  from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c71006d
    • H
      r8152: simply the arguments · ce594e98
      hayeswang 提交于
      Replace &tp->napi with napi and tp->netdev with netdev.
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce594e98
  2. 16 3月, 2017 1 次提交