1. 29 10月, 2017 24 次提交
    • D
      Merge branch 'ipvlan-private-vepa' · aad93c70
      David S. Miller 提交于
      Mahesh Bandewar says:
      
      ====================
      add 'private' and 'vepa' attributes to ipvlan modes
      
      IPvlan has always been operating in bridge-mode for its supported modes i.e.
      if the packets are destined to the adjacent neighbor dev, then IPvlan driver
      will switch the packet internally without needing the packets to hit the
      wire or get routed. However, there are situations where this bridge-mode is
      not needed. e.g. two private processes running inside two namespaces which
      are having one IPvlan slave each for its namespace but sharing the master. These
      processes should reach the outside world through the master device but at
      the same time the bridge function should not work. Currently that's not
      possible hence the private attribute for the selected mode comes in play.
      
      VEPA or 802.1Qbg on the other hand has limited appeal with IPvlan since IPvlan
      uses the mac-address of the lower device. So packets that are destined to
      the adjacent neighbor slave-dev will have same src and dest mac. When these
      packets reach the external switch/router, they will send you the redirect
      message which the host will have to deal with. Having said that this attribute
      will have appeal in debugging as IPvlan will not switch / short-circuit
      packets internally. e.g. using VEPA mode with lower-device in loopback mode
      will avoid some complicated set-ups that use non-local-bind with some route
      jugglery.
      
      This patch-set implements these attributes for the existing modes that
      IPvlan has. Please see individual patches for their detailed implementation.
      A subsequent ip-utils patch is needed and will be sent soon.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aad93c70
    • M
      ipvlan: implement VEPA mode · fe89aa6b
      Mahesh Bandewar 提交于
      This is very similar to the Macvlan VEPA mode, however, there is some
      difference. IPvlan uses the mac-address of the lower device, so the VEPA
      mode has implications of ICMP-redirects for packets destined for its
      immediate neighbors sharing same master since the packets will have same
      source and dest mac. The external switch/router will send redirect msg.
      
      Having said that, this will be useful tool in terms of debugging
      since IPvlan will not switch packets within its slaves and rely completely
      on the external entity as intended in 802.1Qbg.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe89aa6b
    • M
      ipvlan: introduce 'private' attribute for all existing modes. · a190d04d
      Mahesh Bandewar 提交于
      IPvlan has always operated in bridge mode. However there are scenarios
      where each slave should be able to talk through the master device but
      not necessarily across each other. Think of an environment where each
      of a namespace is a private and independant customer. In this scenario
      the machine which is hosting these namespaces neither want to tell who
      their neighbor is nor the individual namespaces care to talk to neighbor
      on short-circuited network path.
      
      This patch implements the mode that is very similar to the 'private' mode
      in macvlan where individual slaves can send and receive traffic through
      the master device, just that they can not talk among slave devices.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a190d04d
    • Q
      tools: bpftool: add bash completion for bpftool · 995231c8
      Quentin Monnet 提交于
      Add a completion file for bash. The completion function runs bpftool
      when needed, making it smart enough to help users complete ids or tags
      for eBPF programs and maps currently on the system.
      
      Update Makefile to install completion file to
      /usr/share/bash-completion/completions when running `make install`.
      
      Emacs file mode and (at the end) Vim modeline have been added, to keep
      the style in use for most existing bash completion files. In this, it
      differs from tools/perf/perf-completion.sh, which seems to be the only
      other completion file among the kernel sources repository. This is also
      valid for indent style: 4-space indents, as in other completion files.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      995231c8
    • W
      net: aquantia: Make local functions static · 2660d226
      Wei Yongjun 提交于
      Fixes the following sparse warnings:
      
      drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c:224:5: warning:
       symbol 'aq_ethtool_get_coalesce' was not declared. Should it be static?
      drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c:245:5: warning:
       symbol 'aq_ethtool_set_coalesce' was not declared. Should it be static?
      Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2660d226
    • W
      ipv6: prevent user from adding cached routes · 2ea2352e
      Wei Wang 提交于
      Cached routes should only be created by the system when receiving pmtu
      discovery or ip redirect msg. Users should not be allowed to create
      cached routes.
      
      Furthermore, after the patch series to move cached routes into exception
      table, user added cached routes will trigger the following warning in
      fib6_add():
      
      WARNING: CPU: 0 PID: 2985 at net/ipv6/ip6_fib.c:1137
      fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 0 PID: 2985 Comm: syzkaller320388 Not tainted 4.14.0-rc3+ #74
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
      RSP: 0018:ffff8801cf09f6a0 EFLAGS: 00010297
      RAX: ffff8801ce45e340 RBX: 1ffff10039e13eec RCX: ffff8801d749c814
      RDX: 0000000000000000 RSI: ffff8801d749c700 RDI: ffff8801d749c780
      RBP: ffff8801cf09fa08 R08: 0000000000000000 R09: ffff8801cf09f360
      R10: ffff8801cf09f2d8 R11: 1ffff10039c8befb R12: 0000000000000001
      R13: dffffc0000000000 R14: ffff8801d749c700 R15: ffffffff860655c0
       __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
       ip6_route_add+0x148/0x1a0 net/ipv6/route.c:2782
       ipv6_route_ioctl+0x4d5/0x690 net/ipv6/route.c:3291
       inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:521
       sock_do_ioctl+0x65/0xb0 net/socket.c:961
       sock_ioctl+0x2c2/0x440 net/socket.c:1058
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      So we fix this by failing the attemp to add cached routes from userspace
      with returning EINVAL error.
      
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2352e
    • T
      samples/bpf: adjust rlimit RLIMIT_MEMLOCK for xdp_redirect_map · 21d72af7
      Tushar Dave 提交于
      Default rlimit RLIMIT_MEMLOCK is 64KB, causes bpf map failure.
      e.g.
      [root@labbpf]# ./xdp_redirect_map $(</sys/class/net/eth2/ifindex) \
      > $(</sys/class/net/eth3/ifindex)
      failed to create a map: 1 Operation not permitted
      
      The failure is 100% when multiple xdp programs are running. Fix it.
      Signed-off-by: NTushar Dave <tushar.n.dave@oracle.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21d72af7
    • T
      samples/bpf: adjust rlimit RLIMIT_MEMLOCK for xdp1 · 6dfca831
      Tushar Dave 提交于
      Default rlimit RLIMIT_MEMLOCK is 64KB, causes bpf map failure.
      e.g.
      [root@lab bpf]#./xdp1 -N $(</sys/class/net/eth2/ifindex)
      failed to create a map: 1 Operation not permitted
      
      Fix it.
      Signed-off-by: NTushar Dave <tushar.n.dave@oracle.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfca831
    • F
      net: dsa: b53: Export b53_configure_vlan() · 5c1a6eaf
      Florian Fainelli 提交于
      bcm_sf2 and b53 replicate the same operations: clear all VLANs and set
      their ports to the default VLAN tag (1 for these devices) so export the
      b53 function doing just that.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c1a6eaf
    • F
      liquidio: get rid of false alarm "Unknown cmd 27" in dmesg · 641da8ed
      Felix Manlunas 提交于
      Creating a macvtap interface with the liquidio VF driver as lower device
      causes this alarming message to show up in dmesg:
      
          liquidio_link_ctrl_cmd_completion Unknown cmd 27
      
      That's actually a false alarm because cmd 27 is the value of the macro
      OCTNET_CMD_SET_UC_LIST which is known.  It's a control command sent from
      host to NIC firmware to set the unicast MAC address list of the macvtap
      lower device.
      
      Make the false alarm go away by adding a case for OCTNET_CMD_SET_UC_LIST
      in liquidio_link_ctrl_cmd_completion().
      Signed-off-by: NFelix Manlunas <felix.manlunas@cavium.com>
      Signed-off-by: NRaghu Vatsavayi <raghu.vatsavayi@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      641da8ed
    • H
      hv_netvsc: Set tx_table to equal weight after subchannels open · a6fb6aa3
      Haiyang Zhang 提交于
      In some cases, like internal vSwitch, the host doesn't provide
      send indirection table updates. This patch sets the table to be
      equal weight after subchannels are all open. Otherwise, all workload
      will be on one TX channel.
      
      As tested, this patch has largely increased the throughput over
      internal vSwitch.
      Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6fb6aa3
    • M
      ppp: allow usage in namespaces · 90e229ef
      Matteo Croce 提交于
      Check for CAP_NET_ADMIN with ns_capable() instead of capable()
      to allow usage of ppp in user namespace other than the init one.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e229ef
    • D
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 87e3de1e
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      1GbE Intel Wired LAN Driver Updates 2017-10-27
      
      This patchset is a proposal of how the Traffic Control subsystem can
      be used to offload the configuration of the Credit Based Shaper
      (defined in the IEEE 802.1Q-2014 Section 8.6.8.2) into supported
      network devices.
      
      As part of this work, we've assessed previous public discussions
      related to TSN enabling: patches from Henrik Austad (Cisco), the
      presentation from Eric Mann at Linux Plumbers 2012, patches from
      Gangfeng Huang (National Instruments) and the current state of the
      OpenAVNU project (https://github.com/AVnu/OpenAvnu/).
      
      Overview
      ========
      
      Time-sensitive Networking (TSN) is a set of standards that aim to
      address resources availability for providing bandwidth reservation and
      bounded latency on Ethernet based LANs. The proposal described here
      aims to cover mainly what is needed to enable the following standards:
      802.1Qat and 802.1Qav.
      
      The initial target of this work is the Intel i210 NIC, but other
      controllers' datasheet were also taken into account, like the Renesas
      RZ/A1H RZ/A1M group and the Synopsis DesignWare Ethernet QoS
      controller.
      
      Proposal
      ========
      
      Feature-wise, what is covered here is the configuration interfaces for
      HW implementations of the Credit-Based shaper (CBS, 802.1Qav). CBS is
      a per-queue shaper. Given that this feature is related to traffic
      shaping, and that the traffic control subsystem already provides a
      queueing discipline that offloads config into the device driver (i.e.
      mqprio), designing a new qdisc for the specific purpose of offloading
      the config for the CBS shaper seemed like a good fit.
      
      For steering traffic into the correct queues, we use the socket option
      SO_PRIORITY and then a mechanism to map priority to traffic classes /
      Tx queues. The qdisc mqprio is currently used in our tests.
      
      As for the CBS config interface, this patchset is proposing a new
      qdisc called 'cbs'. Its 'tc' cmd line is:
      
      $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
           idleslope I
      
         Note that the parameters for this qdisc are the ones defined by the
         802.1Q-2014 spec, so no hardware specific functionality is exposed here.
      
      Per-stream shaping, as defined by IEEE 802.1Q-2014 Section 34.6.1, is
      not yet covered by this proposal.
      
      v2: Merged patch 6 of the original series into patch 4 based on feedback
          from David Miller.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87e3de1e
    • D
      Merge branch 'l2tp-register-sessions-atomically' · 05ce8bd4
      David S. Miller 提交于
      Guillaume Nault says:
      
      ====================
      l2tp: register sessions atomically
      
      Currently l2tp_session_create() allocates a session, partially
      initialises it and finally registers it. It therefore exposes sessions
      that aren't fully initialised to the rest of the system, because
      pseudo-wire specific initialisation can only happen after
      l2tp_session_create() returns.
      This leads to several crashes when these sessions are used or deleted.
      
      This series starts by splitting session registration out of
      l2tp_session_create() (patch #1). Thus allowing pseudo-wires code to
      terminate the initialisation phase before registration.
      
      Then patch #2 fixes the eth pseudo-wire code. This requires protecting
      the session's netdevice pointer with RCU, because it still needs to be
      updated concurrently after the session got registered.
      
      Remaining patches take care of ppp pseudo-wires. RCU protection is
      needed there too, for the same reasons. This time it's the pppol2tp
      socket pointer that gets protected. For clarity, and since the
      conversion requires more modifications, introducing RCU is done in
      its own patch (#3). Then patch #4 only has to take care of fixing
      sessions initialisation and registration (and adapting part of the
      deletion process).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05ce8bd4
    • G
      l2tp: initialise PPP sessions before registering them · f98be6c6
      Guillaume Nault 提交于
      pppol2tp_connect() initialises L2TP sessions after they've been exposed
      to the rest of the system by l2tp_session_register(). This puts
      sessions into transient states that are the source of several races, in
      particular with session's deletion path.
      
      This patch centralises the initialisation code into
      pppol2tp_session_init(), which is called before the registration phase.
      The only field that can't be set before session registration is the
      pppol2tp socket pointer, which has already been converted to RCU. So
      pppol2tp_connect() should now be race-free.
      
      The session's .session_close() callback is now set before registration.
      Therefore, it's always called when l2tp_core deletes the session, even
      if it was created by pppol2tp_session_create() and hasn't been plugged
      to a pppol2tp socket yet. That'd prevent session free because the extra
      reference taken by pppol2tp_session_close() wouldn't be dropped by the
      socket's ->sk_destruct() callback (pppol2tp_session_destruct()).
      We could set .session_close() only while connecting a session to its
      pppol2tp socket, or teach pppol2tp_session_close() to avoid grabbing a
      reference when the session isn't connected, but that'd require adding
      some form of synchronisation to be race free.
      
      Instead of that, we can just let the pppol2tp socket hold a reference
      on the session as soon as it starts depending on it (that is, in
      pppol2tp_connect()). Then we don't need to utilise
      pppol2tp_session_close() to hold a reference at the last moment to
      prevent l2tp_core from dropping it.
      
      When releasing the socket, pppol2tp_release() now deletes the session
      using the standard l2tp_session_delete() function, instead of merely
      removing it from hash tables. l2tp_session_delete() drops the reference
      the sessions holds on itself, but also makes sure it doesn't remove a
      session twice. So it can safely be called, even if l2tp_core already
      tried, or is concurrently trying, to remove the session.
      Finally, pppol2tp_session_destruct() drops the reference held by the
      socket.
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f98be6c6
    • G
      l2tp: protect sock pointer of struct pppol2tp_session with RCU · ee40fb2e
      Guillaume Nault 提交于
      pppol2tp_session_create() registers sessions that can't have their
      corresponding socket initialised. This socket has to be created by
      userspace, then connected to the session by pppol2tp_connect().
      Therefore, we need to protect the pppol2tp socket pointer of L2TP
      sessions, so that it can safely be updated when userspace is connecting
      or closing the socket. This will eventually allow pppol2tp_connect()
      to avoid generating transient states while initialising its parts of the
      session.
      
      To this end, this patch protects the pppol2tp socket pointer using RCU.
      
      The pppol2tp socket pointer is still set in pppol2tp_connect(), but
      only once we know the function isn't going to fail. It's eventually
      reset by pppol2tp_release(), which now has to wait for a grace period
      to elapse before it can drop the last reference on the socket. This
      ensures that pppol2tp_session_get_sock() can safely grab a reference
      on the socket, even after ps->sk is reset to NULL but before this
      operation actually gets visible from pppol2tp_session_get_sock().
      
      The rest is standard RCU conversion: pppol2tp_recv(), which already
      runs in atomic context, is simply enclosed by rcu_read_lock() and
      rcu_read_unlock(), while other functions are converted to use
      pppol2tp_session_get_sock() followed by sock_put().
      pppol2tp_session_setsockopt() is a special case. It used to retrieve
      the pppol2tp socket from the L2TP session, which itself was retrieved
      from the pppol2tp socket. Therefore we can just avoid dereferencing
      ps->sk and directly use the original socket pointer instead.
      
      With all users of ps->sk now handling NULL and concurrent updates, the
      L2TP ->ref() and ->deref() callbacks aren't needed anymore. Therefore,
      rather than converting pppol2tp_session_sock_hold() and
      pppol2tp_session_sock_put(), we can just drop them.
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee40fb2e
    • G
      l2tp: initialise l2tp_eth sessions before registering them · ee28de6b
      Guillaume Nault 提交于
      Sessions must be initialised before being made externally visible by
      l2tp_session_register(). Otherwise the session may be concurrently
      deleted before being initialised, which can confuse the deletion path
      and eventually lead to kernel oops.
      
      Therefore, we need to move l2tp_session_register() down in
      l2tp_eth_create(), but also handle the intermediate step where only the
      session or the netdevice has been registered.
      
      We can't just call l2tp_session_register() in ->ndo_init() because
      we'd have no way to properly undo this operation in ->ndo_uninit().
      Instead, let's register the session and the netdevice in two different
      steps and protect the session's device pointer with RCU.
      
      And now that we allow the session's .dev field to be NULL, we don't
      need to prevent the netdevice from being removed anymore. So we can
      drop the dev_hold() and dev_put() calls in l2tp_eth_create() and
      l2tp_eth_dev_uninit().
      
      Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee28de6b
    • G
      l2tp: don't register sessions in l2tp_session_create() · 3953ae7b
      Guillaume Nault 提交于
      Sessions created by l2tp_session_create() aren't fully initialised:
      some pseudo-wire specific operations need to be done before making the
      session usable. Therefore the PPP and Ethernet pseudo-wires continue
      working on the returned l2tp session while it's already been exposed to
      the rest of the system.
      This can lead to various issues. In particular, the session may enter
      the deletion process before having been fully initialised, which will
      confuse the session removal code.
      
      This patch moves session registration out of l2tp_session_create(), so
      that callers can control when the session is exposed to the rest of the
      system. This is done by the new l2tp_session_register() function.
      
      Only pppol2tp_session_create() can be easily converted to avoid
      modifying its session after registration (the debug message is dropped
      in order to avoid the need for holding a reference on the session).
      
      For pppol2tp_connect() and l2tp_eth_create()), more work is needed.
      That'll be done in followup patches. For now, let's just register the
      session right after its creation, like it was done before. The only
      difference is that we can easily take a reference on the session before
      registering it, so, at least, we're sure it's not going to be freed
      while we're working on it.
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3953ae7b
    • D
      tcp: Remove "linux/unaligned/access_ok.h" include. · 949cf8b1
      David S. Miller 提交于
      This causes build failures:
      
      In file included from net/ipv4/tcp_input.c:79:0:
      ./include/linux/unaligned/access_ok.h:7:28: error: redefinition of
      'get_unaligned_le16'
      In file included from ./include/asm-generic/unaligned.h:17:0,
                       from ./arch/arm/include/generated/asm/unaligned.h:1,
                       from net/ipv4/tcp_input.c:76:
      ./include/linux/unaligned/le_struct.h:6:19: note: previous definition
      of 'get_unaligned_le16' was here
      In file included from net/ipv4/tcp_input.c:79:0:
      ./include/linux/unaligned/access_ok.h:12:28: error: redefinition of
      'get_unaligned_le32'
      
      Plain "asm/access_ok.h", which is already included, is
      sufficient.
      
      Fixes: 60e2a778 ("tcp: TCP experimental option for SMC")
      Reported-by: NEgil Hjelmeland <privat@egil-hjelmeland.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      949cf8b1
    • A
      cxgb3: Check and handle the dma mapping errors · c69fe407
      Arjun Vynipadath 提交于
      This patch adds checks at approprate places whether *dma_map*() call has
      succeeded or not.
      
      Original Work by: Santosh Rastapur <santosh@chelsio.com>
      Signed-off-by: NArjun Vynipadath <arjun@chelsio.com>
      Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c69fe407
    • F
      r8169: Add support for interrupt coalesce tuning (ethtool -C) · 50970831
      Francois Romieu 提交于
      Kirr: In particular with
      
      	ethtool -C <ifname> rx-usecs 0 rx-frames 0
      
      now it is possible to disable RX delays when NIC usage requires low-latency.
      
      See this thread for context:
      
      	https://www.spinics.net/lists/netdev/msg217665.html
      
      My specific case is that:
      
      We have many computers with gigabit Realtek NICs. For 2 such computers
      connected to a gigabit store-and-forward switch the minimum round-trip
      time for small pings (`ping -i 0 -w 3 -s 56 -q peer`) is ~ 30μs.
      
      However it turned out that when Ethernet frame length transitions 127 ->
      128 bytes (`ping -i 0 -w 3 -s {81 -> 82} -q peer`) the lowest RTT
      transitions step-wise to ~ 270μs.
      
      As David Light said this is RX interrupt mitigation done by NIC which creates
      the latency. For workloads when low-latency is required with e.g. Intel,
      BCM etc NIC drivers one just uses `ethtool -C rx-usecs ...` to reduce
      the time NIC delays before interrupting CPU, but it turned out
      `ethtool -C` is not supported by r8169 driver.
      
      Like Stéphane ANCELOT I've traced the problem down to IntrMitigate being
      hardcoded to != 0 for our chips (we have 8168 based NICs):
      
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n5460
      static void rtl_hw_start_8169(struct net_device *dev) {
              ...
              /*
               * Undocumented corner. Supposedly:
               * (TxTimer << 12) | (TxPackets << 8) | (RxTimer << 4) | RxPackets
               */
              RTL_W16(IntrMitigate, 0x0000);
      
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n6346
      static void rtl_hw_start_8168(struct net_device *dev) {
              ...
              RTL_W16(IntrMitigate, 0x5151);
      
      and then I've also found
      
      	https://www.spinics.net/lists/netdev/msg217665.html
      
      and original Francois' patch:
      
      	https://www.spinics.net/lists/netdev/msg217984.html
      	https://www.spinics.net/lists/netdev/msg218207.html
      
      So could we please finally get support for tuning r8169 interrupt
      coalescing in tree? (so that next poor soul who hits the problem does
      not need to go all the way to dig into driver sources and internet
      wildly and finally patch locally
      
              -RTL_W16(IntrMitigate, 0x5151);
              +RTL_W16(IntrMitigate, 0x5100);
      
      guessing whether it is right or not and also having to care to deploy
      the patch everywhere it needs to be used, etc...).
      
      To do so I've took original Francois's patch from 2012 and reworked it a bit:
      
      - updated to latest net-next.git;
      - adjusted scaling setup based on feedback from Hayes to pick up scaling
        vector depending not only on link speed but also on CPlusCmd[0:1] and to
        adjust CPlusCmd[0:1] correspondingly when setting timings;
      - improved a bit (I think so) error handling.
      
      I've tested the patch on "RTL8168d/8111d" (XID 083000c0) and with it and
      `ethtool -C rx-usecs 0 rx-frames 0` on both ends it improves:
      
      - minimum RTT latency:
      
              ~270μs ->  ~30μs (small packet),
              ~330μs -> ~110μs (full 1.5K ethernet frame)
      
      - average RTT latency:
      
              ~480μs ->  ~50μs (small packet),
              ~560μs -> ~125μs (full 1.5K ethernet frame)
      
      ( before:
      
              root@neo1:# ping -i 0 -w 3 -s 82 -q neo2
              PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.
      
              --- neo2.kirr.nexedi.com ping statistics ---
              5906 packets transmitted, 5905 received, 0% packet loss, time 2999ms
              rtt min/avg/max/mdev = 0.274/0.485/0.607/0.026 ms, ipg/ewma 0.508/0.489 ms
      
              root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
              PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.
      
              --- neo2.kirr.nexedi.com ping statistics ---
              5073 packets transmitted, 5073 received, 0% packet loss, time 2999ms
              rtt min/avg/max/mdev = 0.330/0.566/0.710/0.028 ms, ipg/ewma 0.591/0.544 ms
      
        after:
      
              root@neo1# ping -i 0 -w 3 -s 82 -q neo2
              PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.
      
              --- neo2.kirr.nexedi.com ping statistics ---
              45815 packets transmitted, 45815 received, 0% packet loss, time 3000ms
              rtt min/avg/max/mdev = 0.036/0.051/0.368/0.010 ms, ipg/ewma 0.065/0.053 ms
      
              root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
              PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.
      
              --- neo2.kirr.nexedi.com ping statistics ---
              21250 packets transmitted, 21250 received, 0% packet loss, time 3000ms
              rtt min/avg/max/mdev = 0.112/0.125/0.390/0.007 ms, ipg/ewma 0.141/0.125 ms
      
        the small -> 1.5K latency growth is understandable as it takes ~15μs
        to transmit 1.5K on 1Gbps on the wire and with 2 hosts and 1 switch
        and ICMP ECHO + ECHO reply the packet has to travel 4 ethernet
        segments which is already 60μs;
      
        probably something a bit else is also there as e.g. on Linux, even
        with `cpupower frequency-set -g performance`, on some computers I've
        noticed the kernel can be spending more time in software-only mode
        when incoming packets go in less frequently. E.g. this program can
        demonstrate the effect for ICMP ECHO processing:
      
        https://lab.nexedi.com/kirr/bcc/blob/43cfc13b/tools/pinglat.py
      
        (later this was found to be partly due to C-states exit latencies) )
      
      We have this patch running in our testing setup for 1 months already
      without any issues observed.
      
      It remains to be clarified whether RX and TX timers use the same base.
      For now I've set them equally, but Francois's original patch version
      suggests it could be not the same.
      
      I've got no feedback at all to my original posting of this patch and questions
      
      	https://www.spinics.net/lists/netdev/msg457173.html
      
      neither from Francois, nor from any people from Realtek during one month.
      
      So I suggest we simply apply it to net-next.git now.
      
      Cc: Francois Romieu <romieu@fr.zoreil.com>
      Cc: Hayes Wang <hayeswang@realtek.com>
      Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Stéphane ANCELOT <sancelot@free.fr>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50970831
    • D
      Merge branch 'bridge-make-setlink-dellink-notifications-more-accurate' · 8ef2097e
      David S. Miller 提交于
      Nikolay Aleksandrov says:
      
      ====================
      bridge: make setlink/dellink notifications more accurate
      
      Before this set the bridge would generate a notification on vlan add or del
      even if they didn't actually do any changes, which confuses listeners and
      is generally not preferred. We could also lose notifications on actual
      changes if one adds a range of vlans and there's an error in the middle.
      The problem with just breaking and returning an error is that we could
      break existing user-space scripts which rely on the vlan delete to clear
      all existing entries in the specified range and ignore the non-existing
      errors (typically used to clear the current vlan config).
      So in order to make the notifications more accurate while keeping backwards
      compatibility we add a boolean that tracks if anything actually changed
      during the config calls.
      
      The vlan add is more difficult to fix because it always returns 0 even if
      nothing changed, but we cannot use a specific error because the drivers
      can return anything and we may mask it, also we'd need to update all places
      that directly return the add result, thus to signal that a vlan was created
      or updated and in order not to break overlapping vlan range add we pass
      down the new boolean that tracks changes to the add functions to check
      if anything was actually updated.
      
      v6: moved "changed" in else branch in br|nbp_vlan_add, thanks to
          Toshiaki Makita and retested everything again
      v5: fix br_vlan_add return (v1 leftover) spotted by Toshiaki Makita
      v4: set changed always to false in the non-vlan config case and retested
      v3: rebased to latest net-next and fixed non-vlan config functions reported
          by kbuild test bot
      v2: pass changed down to vlan add instead of masking errors
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ef2097e
    • N
      bridge: vlan: signal if anything changed on vlan add · f418af63
      Nikolay Aleksandrov 提交于
      Before this patch there was no way to tell if the vlan add operation
      actually changed anything, thus we would always generate a notification
      on adds. Let's make the notifications more precise and generate them
      only if anything changed, so use the new bool parameter to signal that the
      vlan was updated. We cannot return an error because there are valid use
      cases that will be broken (e.g. overlapping range add) and also we can't
      risk masking errors due to calls into drivers for vlan add which can
      potentially return anything.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f418af63
    • N
      bridge: netlink: make setlink/dellink notifications more accurate · e19b42a1
      Nikolay Aleksandrov 提交于
      Before this patch we had cases that either sent notifications when there
      were in fact no changes (e.g. non-existent vlan delete) or didn't send
      notifications when there were changes (e.g. vlan add range with an error in
      the middle, port flags change + vlan update error). This patch sends down
      a boolean to the functions setlink/dellink use and if there is even a
      single configuration change (port flag, vlan add/del, port state) then
      we always send a notification. This is all done to keep backwards
      compatibility with the opportunistic vlan delete, where one could
      specify a vlan range that has missing vlans inside and still everything
      in that range will be cleared, this is mostly used to clear the whole
      vlan config with a single call, i.e. range 1-4094.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e19b42a1
  2. 28 10月, 2017 16 次提交