1. 31 12月, 2019 9 次提交
    • V
      net/ncsi: Fix gma flag setting after response · 9e860947
      Vijay Khemka 提交于
      gma_flag was set at the time of GMA command request but it should
      only be set after getting successful response. Movinng this flag
      setting in GMA response handler.
      
      This flag is used mainly for not repeating GMA command once
      received MAC address.
      Signed-off-by: NVijay Khemka <vijaykhemka@fb.com>
      Reviewed-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e860947
    • K
      sctp: add enabled check for path tracepoint loop. · f398efc1
      Kevin Kou 提交于
      sctp_outq_sack is the main function handles SACK, it is called very
      frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack"
      added below code to this function, sctp tracepoint is disabled most of time,
      but the loop of transport list will be always called even though the
      tracepoint is disabled, this is unnecessary.
      
      +	/* SCTP path tracepoint for congestion control debugging. */
      +	list_for_each_entry(transport, transport_list, transports) {
      +		trace_sctp_probe_path(transport, asoc);
      +	}
      
      This patch is to add tracepoint enabled check at outside of the loop of
      transport list, and avoid traversing the loop when trace is disabled,
      it is a small optimization.
      Signed-off-by: NKevin Kou <qdkevin.kou@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f398efc1
    • D
      Merge branch 'Improvements-to-SJA1105-DSA-RX-timestamping' · 9010ef57
      David S. Miller 提交于
      Vladimir Oltean says:
      
      ====================
      Improvements to SJA1105 DSA RX timestamping
      
      This series makes the sja1105 DSA driver use a dedicated kernel thread
      for RX timestamping, a process which is time-sensitive and otherwise a
      bit fragile. This allows users to customize their system (probabil an
      embedded PTP switch) fully and allocate the CPU bandwidth for the driver
      to expedite the RX timestamps as quickly as possible.
      
      While doing this conversion, add a function to the PTP core for
      cancelling this kernel thread (function which I found rather strange to
      be missing).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9010ef57
    • V
      net: dsa: sja1105: Empty the RX timestamping queue on PTP settings change · 19d1f0ed
      Vladimir Oltean 提交于
      When disabling PTP timestamping, don't reset the switch with the new
      static config until all existing PTP frames have been timestamped on the
      RX path or dropped. There's nothing we can do with these afterwards.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19d1f0ed
    • V
      net: dsa: sja1105: Use PTP core's dedicated kernel thread for RX timestamping · 1e762bd2
      Vladimir Oltean 提交于
      And move the queue of skb's waiting for RX timestamps into the ptp_data
      structure, since it isn't needed if PTP is not compiled.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e762bd2
    • V
      ptp: introduce ptp_cancel_worker_sync · 544fed47
      Vladimir Oltean 提交于
      In order to effectively use the PTP kernel thread for tasks such as
      timestamping packets, allow the user control over stopping it, which is
      needed e.g. when the timestamping queues must be drained.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      544fed47
    • C
      sfc: avoid duplicate error handling code in 'efx_ef10_sriov_set_vf_mac()' · db99d512
      Christophe JAILLET 提交于
      'eth_zero_addr()' is already called in the error handling path. This is
      harmless, but there is no point in calling it twice, so remove one.
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db99d512
    • E
      tcp_cubic: refactor code to perform a divide only when needed · f278b99c
      Eric Dumazet 提交于
      Neal Cardwell suggested to not change ca->delay_min
      and apply the ack delay cushion only when Hystart ACK train
      is still under consideration. This should avoid a 64bit
      divide unless needed.
      
      Tested:
      
      40Gbit(mlx4) testbed (with sch_fq as packet scheduler)
      
      $ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14815
        16280
        15293
        15563
        11574
        15145
        14789
        18548
        16972
        12520
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1396               0.0
      $ dmesg | tail -10
      [ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
      [ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
      [ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
      [ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
      [ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
      [ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
      [ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
      [ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
      [ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
      [ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152
      
      10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)
      
      $ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart"
         7050
         7065
         7100
         6900
         7202
         7263
         7189
         6869
         7463
         7034
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       3199               0.0
      $ dmesg | tail -10
      [  176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
      [  179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
      [  181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
      [  183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
      [  185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
      [  187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
      [  190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
      [  192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
      [  194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
      [  196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399
      
      Fixes: 42f3a8aa ("tcp_cubic: tweak Hystart detection for short RTT flows")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://www.spinics.net/lists/netdev/msg621886.html
      Link: https://www.spinics.net/lists/netdev/msg621797.htmlAcked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f278b99c
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · ba402810
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner.
      
      2) Document ingress hook in netdevice, also from Lukas.
      
      3) Remove htons() in tunnel metadata port netlink attributes,
         from Xin Long.
      
      4) Missing erspan netlink attribute validation also from Xin Long.
      
      5) Missing erspan version in tunnel, from Xin Long.
      
      6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN}
         Patch from Xin Long.
      
      7) Missing nla_nest_cancel() in tunnel netlink dump path,
         from Xin Long.
      
      8) Remove two exported conntrack symbols with no clients,
         from Florian Westphal.
      
      9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian.
      
      10) Add nft_meta_pkttype helper for loopback, also from Florian.
      
      11) Add nft_meta_socket uid helper, from Florian Westphal.
      
      12) Add nft_meta_cgroup helper, from Florian.
      
      13) Add nft_meta_ifkind helper, from Florian.
      
      14) Group all interface related meta selector, from Florian.
      
      15) Add nft_prandom_u32() helper, from Florian.
      
      16) Add nft_meta_rtclassid helper, from Florian.
      
      17) Add support for matching on the slave device index,
          from Florian.
      
      This batch, among other things, contains updates for the netfilter
      tunnel netlink interface: This extension is still incomplete and lacking
      proper userspace support which is actually my fault, I did not find the
      time to go back and finish this. This update is breaking tunnel UAPI in
      some aspects to fix it but do it better sooner than never.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba402810
  2. 29 12月, 2019 4 次提交
    • D
      Merge branch 'DSA-TX-tstamp' · 1a1fda57
      David S. Miller 提交于
      Vladimir Oltean says:
      
      ====================
      The DSA TX timestamping situation
      
      This series is the moral v2 of "[PATCH net] net: dsa: sja1105: Fix
      double delivery of TX timestamps to socket error queue" [0] which did
      not manage to convince public opinion (actually it didn't convince me
      neither).
      
      This fixes PTP timestamping on one particular board, where the DSA
      switch is sja1105 and the master is gianfar. Unfortunately there is no
      way to make the fix more general without committing logical
      inaccuracies: the SKBTX_IN_PROGRESS flag does serve a purpose, even if
      the sja1105 driver is not using it now: it prevents delivering a SW
      timestamp to the app socket when the HW timestamp will be provided. So
      not setting this flag (the approach from v1) might create avoidable
      complications in the future (not to mention that there isn't any
      satisfactory explanation on why that would be the correct solution).
      
      So the goal of this change set is to create a more strict framework for
      DSA master devices when attached to PTP switches, and to fix the first
      master driver that is overstepping its duties and is delivering
      unsolicited TX timestamps.
      
      [0]: https://www.spinics.net/lists/netdev/msg619699.html
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a1fda57
    • V
      net: dsa: Deny PTP on master if switch supports it · f685e609
      Vladimir Oltean 提交于
      It is possible to kill PTP on a DSA switch completely and absolutely,
      until a reboot, with a simple command:
      
      tcpdump -i eth2 -j adapter_unsynced
      
      where eth2 is the switch's DSA master.
      
      Why? Well, in short, the PTP API in place today is a bit rudimentary and
      relies on applications to retrieve the TX timestamps by polling the
      error queue and looking at the cmsg structure. But there is no timestamp
      identification of any sorts (except whether it's HW or SW), you don't
      know how many more timestamps are there to come, which one is this one,
      from whom it is, etc. In other words, the SO_TIMESTAMPING API is
      fundamentally limited in that you can get a single HW timestamp from the
      stack.
      
      And the "-j adapter_unsynced" flag of tcpdump enables hardware
      timestamping.
      
      So let's imagine what happens when the DSA master decides it wants to
      deliver TX timestamps to the skb's socket too:
      - The timestamp that the user space sees is taken by the DSA master.
        Whereas the RX timestamp will eventually be overwritten by the DSA
        switch. So the RX and TX timestamps will be in different time bases
        (aka garbage).
      - The user space applications have no way to deal with the second (real)
        TX timestamp finally delivered by the DSA switch, or even to know to
        wait for it.
      
      Take ptp4l from the linuxptp project, for example. This is its behavior
      after running tcpdump, before the patch:
      
      ptp4l[172]: [6469.594] Unexpected data on socket err queue:
      ptp4l[172]: [6469.693] rms    8 max   16 freq -21257 +/-  11 delay   748 +/-   0
      ptp4l[172]: [6469.711] Unexpected data on socket err queue:
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 05 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.721] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b1 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.838] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 06 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.848] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 13 02
      ptp4l[172]: 0010 00 36 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 04 1a 45 05 7f
      ptp4l[172]: 0030 00 00 5e 05 41 32 27 c2 1a 68 00 04 9f ff fe 05
      ptp4l[172]: 0040 de 06 00 01
      ptp4l[172]: [6469.855] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b2 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: [6469.974] Unexpected data on socket err queue:
      ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
      ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
      ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 07 00 fd
      ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
      
      The ptp4l program itself is heavily patched to show this (more details
      here [0]). Otherwise, by default it just hangs.
      
      On the other hand, with the DSA patch to disallow HW timestamping
      applied:
      
      tcpdump -i eth2 -j adapter_unsynced
      tcpdump: SIOCSHWTSTAMP failed: Device or resource busy
      
      So it is a fact of life that PTP timestamping on the DSA master is
      incompatible with timestamping on the switch MAC, at least with the
      current API. And if the switch supports PTP, taking the timestamps from
      the switch MAC is highly preferable anyway, due to the fact that those
      don't contain the queuing latencies of the switch. So just disallow PTP
      on the DSA master if there is any PTP-capable switch attached.
      
      [0]: https://sourceforge.net/p/linuxptp/mailman/message/36880648/
      
      Fixes: 0336369d ("net: dsa: forward hardware timestamping ioctls to switch driver")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f685e609
    • V
      gianfar: Fix TX timestamping with a stacked DSA driver · c26a2c2d
      Vladimir Oltean 提交于
      The driver wrongly assumes that it is the only entity that can set the
      SKBTX_IN_PROGRESS bit of the current skb. Therefore, in the
      gfar_clean_tx_ring function, where the TX timestamp is collected if
      necessary, the aforementioned bit is used to discriminate whether or not
      the TX timestamp should be delivered to the socket's error queue.
      
      But a stacked driver such as a DSA switch can also set the
      SKBTX_IN_PROGRESS bit, which is actually exactly what it should do in
      order to denote that the hardware timestamping process is undergoing.
      
      Therefore, gianfar would misinterpret the "in progress" bit as being its
      own, and deliver a second skb clone in the socket's error queue,
      completely throwing off a PTP process which is not expecting to receive
      it, _even though_ TX timestamping is not enabled for gianfar.
      
      There have been discussions [0] as to whether non-MAC drivers need or
      not to set SKBTX_IN_PROGRESS at all (whose purpose is to avoid sending 2
      timestamps, a sw and a hw one, to applications which only expect one).
      But as of this patch, there are at least 2 PTP drivers that would break
      in conjunction with gianfar: the sja1105 DSA switch and the felix
      switch, by way of its ocelot core driver.
      
      So regardless of that conclusion, fix the gianfar driver to not do stuff
      based on flags set by others and not intended for it.
      
      [0]: https://www.spinics.net/lists/netdev/msg619699.html
      
      Fixes: f0ee7acf ("gianfar: Add hardware TX timestamping support")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c26a2c2d
    • C
      net/wan/fsl_ucc_hdlc: remove set but not used variables 'ut_info' and 'ret' · 270fe2ce
      Chen Zhou 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      drivers/net/wan/fsl_ucc_hdlc.c: In function ucc_hdlc_irq_handler:
      drivers/net/wan/fsl_ucc_hdlc.c:643:23:
      	warning: variable ut_info set but not used [-Wunused-but-set-variable]
      drivers/net/wan/fsl_ucc_hdlc.c: In function uhdlc_suspend:
      drivers/net/wan/fsl_ucc_hdlc.c:880:23:
      	warning: variable ut_info set but not used [-Wunused-but-set-variable]
      drivers/net/wan/fsl_ucc_hdlc.c: In function uhdlc_resume:
      drivers/net/wan/fsl_ucc_hdlc.c:925:6:
      	warning: variable ret set but not used [-Wunused-but-set-variable]
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NChen Zhou <chenzhou10@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      270fe2ce
  3. 28 12月, 2019 27 次提交
    • D
      Merge branch 'ethtool-netlink-part-one' · 1b3b289f
      David S. Miller 提交于
      Michal Kubecek says:
      
      ====================
      ethtool netlink interface, part 1
      
      This is first part of netlink based alternative userspace interface for
      ethtool. It aims to address some long known issues with the ioctl
      interface, mainly lack of extensibility, raciness, limited error reporting
      and absence of notifications. The goal is to allow userspace ethtool
      utility to provide all features it currently does but without using the
      ioctl interface. However, some features provided by ethtool ioctl API will
      be available through other netlink interfaces (rtnetlink, devlink) if it's
      more appropriate.
      
      The interface uses generic netlink family "ethtool" and provides multicast
      group "monitor" which is used for notifications. Documentation for the
      interface is in Documentation/networking/ethtool-netlink.rst file. The
      netlink interface is optional, it is built when CONFIG_ETHTOOL_NETLINK
      (bool) option is enabled.
      
      There are three types of request messages distinguished by suffix "_GET"
      (query for information), "_SET" (modify parameters) and "_ACT" (perform an
      action). Kernel reply messages have name with additional suffix "_REPLY"
      (e.g. ETHTOOL_MSG_SETTINGS_GET_REPLY). Most "_SET" and "_ACT" message types
      do not have matching reply type as only some of them need additional reply
      data beyond numeric error code and extack. Kernel also broadcasts
      notification messages ("_NTF" suffix) on changes.
      
      Basic concepts:
      
      - make extensions easier not only by allowing new attributes but also by
        imposing as few artificial limits as possible, e.g. by using arbitrary
        size bit sets for most bitmap attributes or by not using fixed size
        strings
      - use extack for error reporting and warnings
      - send netlink notifications on changes (even if they were done using the
        ioctl interface) and actions
      - avoid the racy read/modify/write cycle between kernel and userspace by
        sending only attributes which userspace wants to change; there is still
        a read/modify/write cycle between generic kernel code and ethtool_ops
        handler in NIC driver but it is only in kernel and under RTNL lock
      - reduce the number of name lists that need to be kept in sync between
        kernel and userspace (e.g. recognized link modes)
      - where feasible, allow dump requests to query specific information for all
        network devices
      - as parsing and generating netlink messages is more complicated than
        simply copying data structures between userspace API and ethtool_ops
        handlers (which most ioctl commands do), split the code into multiple
        files in net/ethtool directory; move net/core/ethtool.c also to this
        directory and rename it to ioctl.c
      
      Changes between v8 and v9:
      
      - fix ethnl_update_u8()
      - fix description of ETHTOOL_A_LINKSTATE_LINK in rst file
      - add explanation of verbose vs. compact bitset usage to documentation
      - link ethtool-netlink.rst into toctree
      
      Main changes between v7 and v8:
      
      - preliminary patches sent as a separate series (already in net-next)
      - split notification related changes out of _SET patches
      - drop request specific flags from common header
      - use FLAG/flag rather than GFLAG/gflag for global flags (as there are
        only global flags now)
      - allow device names up to ALTIFNAMSIZ characters
      - rename ETHTOOL_A_BITSET_LIST to ETHTOOL_A_BITSET_NOMASK
      - rename ETHTOOL_A_BIT{,S}_* to ETHTOOL_A_BITSET_BIT{,S}_*
      - use standard bitset helpers for link modes (rather than in-place
        conversion)
      - use "default" rather than "standard" for unified _GET handlers
      - fixed 64-bit big endian bitset code
      
      Main changes between v6 and v7:
      
      - split complex messages into small single purpose ones (drop info and
        request masks and one level of nesting)
      - separate request information and reply data into two structures
      - refactor bitset handling (no simultaneous u32/ulong handling but avoid
        kmalloc() except for long bitmaps on 64-bit big endian architectures)
      - use only fixed size strings internally (will be replaced by char *
        eventually but that will require rewriting also existing ioctl code)
      - rework ethnl_update_* helpers to return error code
      - rename request flag constants (to ETHTOOL_[GR]FLAG_ prefix)
      - convert documentation to rst
      
      Main changes between v5 and v6:
      
      - use ETHTOOL_MSG_ prefix for message types
      - replace ETHA_ prefix for netlink attributes by ETHTOOL_A_
      - replace ETH_x_IM_y for infomask bits by ETHTOOL_IM_x_y
      - split GET reply types from SET requests and notifications
      - split kernel and userspace message types into different enums
      - remove INFO_GET requests from submitted part
      - drop EVENT notifications (use rtnetlink and on-demand string set load)
      - reorganize patches to reduce the number of intermitent warnings
      - unify request/reply header and its processing
      - another nest around strings in a string set for consistency
      - more consistent identifier naming
      - coding style cleanup
      - get rid of some of the helpers
      - set bad attribute in extack where applicable
      - various bug fixes
      - improve documentation and code comments, more kerneldoc comments
      - more verbose commit messages
      
      Changes between v4 and v5:
      
      - do not panic on failed initialization, only WARN()
      
      Main changes between RFC v3 and v4:
      
      - use more kerneldoc style comments
      - strict attribute policy checking
      - use macros for tables of link mode names and parameters
      - provide permanent hardware address in rtnetlink
      - coding style cleanup
      - split too long patches, reorder
      - wrap more ETHA_SETTINGS_* attributes in nests
      - add also some SET_* implementation into submitted part
      
      Main changes between RFC v2 and RFC v3:
      
      - do not allow building as a module (no netdev notifiers needed)
      - drop some obsolete fields
      - add permanent hw address, timestamping and private flags support
      - rework bitset handling to get rid of variable length arrays
      - notify monitor on device renames
      - restructure GET_SETTINGS/SET_SETTINGS messages
      - split too long patches and submit only first part of the series
      
      Main changes between RFC v1 and RFC v2:
      
      - support dumps for all "get" requests
      - provide notifications for changes related to supported request types
      - support getting string sets (both global and per device)
      - support getting/setting device features
      - get rid of family specific header, everything passed as attributes
      - split netlink code into multiple files in net/ethtool/ directory
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b3b289f
    • M
      ethtool: provide link state with LINKSTATE_GET request · 3d2b847f
      Michal Kubecek 提交于
      Implement LINKSTATE_GET netlink request to get link state information.
      
      At the moment, only link up flag as provided by ETHTOOL_GLINK ioctl command
      is returned.
      
      LINKSTATE_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d2b847f
    • M
      ethtool: add LINKMODES_NTF notification · 1b1b1847
      Michal Kubecek 提交于
      Send ETHTOOL_MSG_LINKMODES_NTF notification message whenever device link
      settings or advertised modes are modified using ETHTOOL_MSG_LINKMODES_SET
      netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.
      
      The notification message has the same format as reply to LINKMODES_GET
      request. ETHTOOL_MSG_LINKMODES_SET netlink request only triggers the
      notification if there is a change but the ioctl command handlers do not
      check if there is an actual change and trigger the notification whenever
      the commands are executed.
      
      As all work is done by ethnl_default_notify() handler and callback
      functions introduced to handle LINKMODES_GET requests, all that remains is
      adding entries for ETHTOOL_MSG_LINKMODES_NTF into ethnl_notify_handlers and
      ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
      needed.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b1b1847
    • M
      ethtool: set link modes related data with LINKMODES_SET request · bfbcfe20
      Michal Kubecek 提交于
      Implement LINKMODES_SET netlink request to set advertised linkmodes and
      related attributes as ETHTOOL_SLINKSETTINGS and ETHTOOL_SSET commands do.
      
      The request allows setting autonegotiation flag, speed, duplex and
      advertised link modes.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfbcfe20
    • M
      ethtool: provide link mode information with LINKMODES_GET request · f625aa9b
      Michal Kubecek 提交于
      Implement LINKMODES_GET netlink request to get link modes related
      information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl
      commands.
      
      This request provides supported, advertised and peer advertised link modes,
      autonegotiation flag, speed and duplex.
      
      LINKMODES_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f625aa9b
    • M
      ethtool: add LINKINFO_NTF notification · 73286734
      Michal Kubecek 提交于
      Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link
      settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or
      ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.
      
      The notification message has the same format as reply to LINKINFO_GET
      request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the
      notification if there is a change but the ioctl command handlers do not
      check if there is an actual change and trigger the notification whenever
      the commands are executed.
      
      As all work is done by ethnl_default_notify() handler and callback
      functions introduced to handle LINKINFO_GET requests, all that remains is
      adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and
      ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
      needed.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73286734
    • M
      ethtool: add default notification handler · 5cf2a548
      Michal Kubecek 提交于
      The ethtool netlink notifications have the same format as related GET
      replies so that if generic GET handling framework is used to process GET
      requests, its callbacks and instance of struct get_request_ops can be
      also used to compose corresponding notification message.
      
      Provide function ethnl_std_notify() to be used as notification handler in
      ethnl_notify_handlers table.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cf2a548
    • M
      ethtool: set link settings with LINKINFO_SET request · a53f3d41
      Michal Kubecek 提交于
      Implement LINKINFO_SET netlink request to set link settings queried by
      LINKINFO_GET message.
      
      Only physical port, phy MDIO address and MDI(-X) control can be set,
      attempt to modify MDI(-X) status and transceiver is rejected.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a53f3d41
    • M
      ethtool: provide link settings with LINKINFO_GET request · 459e0b81
      Michal Kubecek 提交于
      Implement LINKINFO_GET netlink request to get basic link settings provided
      by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands.
      
      This request provides settings not directly related to autonegotiation and
      link mode selection: physical port, phy MDIO address, MDI(-X) status,
      MDI(-X) control and transceiver.
      
      LINKINFO_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      459e0b81
    • M
      ethtool: provide string sets with STRSET_GET request · 71921690
      Michal Kubecek 提交于
      Requests a contents of one or more string sets, i.e. indexed arrays of
      strings; this information is provided by ETHTOOL_GSSET_INFO and
      ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all
      information can be retrieved with one request and mulitple string sets can
      be requested at once.
      
      There are three types of requests:
      
        - no NLM_F_DUMP, no device: get "global" stringsets
        - no NLM_F_DUMP, with device: get string sets related to the device
        - NLM_F_DUMP, no device: get device related string sets for all devices
      
      Client can request either all string sets of given type (global or device
      related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only
      set sizes (numbers of strings) are returned.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71921690
    • M
      ethtool: default handlers for GET requests · 728480f1
      Michal Kubecek 提交于
      Significant part of GET request processing is common for most request
      types but unfortunately it cannot be easily separated from type specific
      code as we need to alternate between common actions (parsing common request
      header, allocating message and filling netlink/genetlink headers etc.) and
      specific actions (querying the device, composing the reply). The processing
      also happens in three different situations: "do" request, "dump" request
      and notification, each doing things in slightly different way.
      
      The request specific code is implemented in four or five callbacks defined
      in an instance of struct get_request_ops:
      
        parse_request() - parse incoming message
        prepare_data()  - retrieve data from driver or NIC
        reply_size()    - estimate reply message size
        fill_reply()    - compose reply message
        cleanup_data()  - (optional) clean up additional data
      
      Other members of struct get_request_ops describe the data structure holding
      information from client request and data used to compose the message. The
      default handlers ethnl_default_doit(), ethnl_default_dumpit(),
      ethnl_default_start() and ethnl_default_done() can be then used in genl_ops
      handler. Notification handler will be introduced in a later patch.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      728480f1
    • M
      ethtool: support for netlink notifications · 6b08d6c1
      Michal Kubecek 提交于
      Add infrastructure for ethtool netlink notifications. There is only one
      multicast group "monitor" which is used to notify userspace about changes
      and actions performed. Notification messages (types using suffix _NTF)
      share the format with replies to GET requests.
      
      Notifications are supposed to be broadcasted on every configuration change,
      whether it is done using the netlink interface or ioctl one. Netlink SET
      requests only trigger a notification if some data is actually changed.
      
      To trigger an ethtool notification, both ethtool netlink and external code
      use ethtool_notify() helper. This helper requires RTNL to be held and may
      sleep. Handlers sending messages for specific notification message types
      are registered in ethnl_notify_handlers array. As notifications can be
      triggered from other code, ethnl_ok flag is used to prevent an attempt to
      send notification before genetlink family is registered.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b08d6c1
    • M
      ethtool: netlink bitset handling · 10b518d4
      Michal Kubecek 提交于
      The ethtool netlink code uses common framework for passing arbitrary
      length bit sets to allow future extensions. A bitset can be a list (only
      one bitmap) or can consist of value and mask pair (used e.g. when client
      want to modify only some bits). A bitset can use one of two formats:
      verbose (bit by bit) or compact.
      
      Verbose format consists of bitset size (number of bits), list flag and
      an array of bit nests, telling which bits are part of the list or which
      bits are in the mask and which of them are to be set. In requests, bits
      can be identified by index (position) or by name. In replies, kernel
      provides both index and name. Verbose format is suitable for "one shot"
      applications like standard ethtool command as it avoids the need to
      either keep bit names (e.g. link modes) in sync with kernel or having to
      add an extra roundtrip for string set request (e.g. for private flags).
      
      Compact format uses one (list) or two (value/mask) arrays of 32-bit
      words to store the bitmap(s). It is more suitable for long running
      applications (ethtool in monitor mode or network management daemons)
      which can retrieve the names once and then pass only compact bitmaps to
      save space.
      
      Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS
      flag in request header tells kernel which format to use in reply.
      Notifications always use compact format.
      
      As some code uses arrays of unsigned long for internal representation and
      some arrays of u32 (or even a single u32), two sets of parse/compose
      helpers are introduced. To avoid code duplication, helpers for unsigned
      long arrays are implemented as wrappers around helpers for u32 arrays.
      There are two reasons for this choice: (1) u32 arrays are more frequent in
      ethtool code and (2) unsigned long array can be always interpreted as an
      u32 array on little endian 64-bit and all 32-bit architectures while we
      would need special handling for odd number of u32 words in the opposite
      direction.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b518d4
    • M
      ethtool: helper functions for netlink interface · 041b1c5d
      Michal Kubecek 提交于
      Add common request/reply header definition and helpers to parse request
      header and fill reply header. Provide ethnl_update_* helpers to update
      structure members from request attributes (to be used for *_SET requests).
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      041b1c5d
    • M
      ethtool: introduce ethtool netlink interface · 2b4a8990
      Michal Kubecek 提交于
      Basic genetlink and init infrastructure for the netlink interface, register
      genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to
      make the build optional. Add initial overall interface description into
      Documentation/networking/ethtool-netlink.rst, further patches will add more
      detailed information.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b4a8990
    • K
      sctp: do trace_sctp_probe after SACK validation and check · 356b23c0
      Kevin Kou 提交于
      The function sctp_sf_eat_sack_6_2 now performs the Verification
      Tag validation, Chunk length validation, Bogu check, and also
      the detection of out-of-order SACK based on the RFC2960
      Section 6.2 at the beginning, and finally performs the further
      processing of SACK. The trace_sctp_probe now triggered before
      the above necessary validation and check.
      
      this patch is to do the trace_sctp_probe after the chunk sanity
      tests, but keep doing trace if the SACK received is out of order,
      for the out-of-order SACK is valuable to congestion control
      debugging.
      
      v1->v2:
       - keep doing SCTP trace if the SACK is out of order as Marcelo's
         suggestion.
      v2->v3:
       - regenerate the patch as v2 generated on top of v1, and add
         'net-next' tag to the new one as Marcelo's comments.
      Signed-off-by: NKevin Kou <qdkevin.kou@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      356b23c0
    • N
      mv88e6xxx: Add serdes Rx statistics · 0df95287
      Nikita Yushchenko 提交于
      If packet checker is enabled in the serdes, then Rx counter registers
      start working, and no side effects have been detected.
      
      This patch enables packet checker automatically when powering serdes on,
      and exposes Rx counter registers via ethtool statistics interface.
      
      Code partially basded by older attempt by Andrew Lunn.
      Signed-off-by: NNikita Yushchenko <nikita.yoush@cogentembedded.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0df95287
    • Y
      net: ena: remove set but not used variable 'rx_ring' · cad451dd
      YueHaibing 提交于
      drivers/net/ethernet/amazon/ena/ena_netdev.c: In function ena_xdp_xmit_buff:
      drivers/net/ethernet/amazon/ena/ena_netdev.c:316:19: warning:
       variable rx_ring set but not used [-Wunused-but-set-variable]
      
      commit 548c4940 ("net: ena: Implement XDP_TX action")
      left behind this unused variable.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cad451dd
    • M
      net: dsa: qca: ar9331: drop pointless static qualifier in ar9331_sw_mbus_init · c8f957df
      Mao Wenan 提交于
      There is no need to set variable 'mbus' static
      since new value always be assigned before use it.
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Reviewed-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8f957df
    • X
      ppp: Remove redundant BUG_ON() check in ppp_pernet · 8a3f44a0
      Xu Wang 提交于
      Passing NULL to ppp_pernet causes a crash via BUG_ON.
      Dereferencing net in net_generic() also has the same effect.
      This patch removes the redundant BUG_ON check on the same parameter.
      Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a3f44a0
    • D
      Merge branch 'tcp_cubic-various-fixes' · 36a78867
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp_cubic: various fixes
      
      This patch series converts tcp_cubic to usec clock resolution
      for Hystart logic.
      
      This makes Hystart more relevant for data-center flows.
      Prior to this series, Hystart was not kicking, or was
      kicking without good reason, since the 1ms clock was too coarse.
      
      Last patch also fixes an issue with Hystart vs TCP pacing.
      
      v2: removed a last-minute debug chunk from last patch
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36a78867
    • E
      tcp_cubic: make Hystart aware of pacing · ede656e8
      Eric Dumazet 提交于
      For years we disabled Hystart ACK train detection at Google
      because it was fooled by TCP pacing.
      
      ACK train detection uses a simple heuristic, detecting if
      we receive ACK past half the RTT, to exit slow start before
      hitting the bottleneck and experience massive drops.
      
      But pacing by design might delay packets up to RTT/2,
      so we need to tweak the Hystart logic to be aware of this
      extra delay.
      
      Tested:
       Added a 100 usec delay at receiver.
      
      Before:
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9117
         7057
         9553
         8300
         7030
         6849
         9533
        10126
         6876
         8473
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1230               0.0
      
      After :
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9845
        10103
        10866
        11096
        11936
        11487
        11773
        12188
        11066
        11894
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       6462               0.0
      
      Disabling Hystart ACK Train detection gives similar numbers
      
      echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11173
        10954
        12455
        10627
        11578
        11583
        11222
        10880
        10665
        11366
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede656e8
    • E
      tcp_cubic: tweak Hystart detection for short RTT flows · 42f3a8aa
      Eric Dumazet 提交于
      After switching ca->delay_min to usec resolution, we exit
      slow start prematurely for very low RTT flows, setting
      snd_ssthresh to 20.
      
      The reason is that delay_min is fed with RTT of small packet
      trains. Then as cwnd is increased, TCP sends bigger TSO packets.
      
      LRO/GRO aggregation and/or interrupt mitigation strategies
      on receiver tend to inflate RTT samples.
      
      Fix this by adding to delay_min the expected delay of
      two TSO packets, given current pacing rate.
      
      Tested:
      
      Sender uses pfifo_fast qdisc
      
      Before :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11348
        11707
        11562
        11428
        11773
        11534
         9878
        11693
        10597
        10968
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       200                0.0
      
      After :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14877
        14517
        15797
        18466
        17376
        14833
        17558
        17933
        16039
        18059
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1670               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42f3a8aa
    • E
      tcp_cubic: switch bictcp_clock() to usec resolution · cff04e2d
      Eric Dumazet 提交于
      Current 1ms clock feeds ca->round_start, ca->delay_min,
      ca->last_ack.
      
      This is quite problematic for data-center flows, where delay_min
      is way below 1 ms.
      
      This means Hystart Train detection triggers every time jiffies value
      is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
      expression becomes true.
      
      This kind of random behavior can be solved by reusing the existing
      usec timestamp that TCP keeps in tp->tcp_mstamp
      
      Note that a followup patch will tweak things a bit, because
      during slow start, GRO aggregation on receivers naturally
      increases the RTT as TSO packets gradually come to ~64KB size.
      
      To recap, right after this patch CUBIC Hystart train detection
      is more aggressive, since short RTT flows might exit slow start at
      cwnd = 20, instead of being possibly unbounded.
      
      Following patch will address this problem.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cff04e2d
    • E
      tcp_cubic: remove one conditional from hystart_update() · 35821fc2
      Eric Dumazet 提交于
      If we initialize ca->curr_rtt to ~0U, we do not need to test
      for zero value in hystart_update()
      
      We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
      been processed, and thus ca->curr_rtt will have a sane value.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35821fc2
    • E
      tcp_cubic: optimize hystart_update() · 473900a5
      Eric Dumazet 提交于
      We do not care which bit in ca->found is set.
      
      We avoid accessing hystart and hystart_detect unless really needed,
      possibly avoiding one cache line miss.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      473900a5
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2bbc078f
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2019-12-27
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 127 non-merge commits during the last 17 day(s) which contain
      a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).
      
      There are three merge conflicts. Conflicts and resolution looks as follows:
      
      1) Merge conflict in net/bpf/test_run.c:
      
      There was a tree-wide cleanup c593642c ("treewide: Use sizeof_field() macro")
      which gets in the way with b590cb5f ("bpf: Switch to offsetofend in
      BPF_PROG_TEST_RUN"):
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                                   sizeof_field(struct __sk_buff, priority),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
        >>>>>>> 7c8dce4b
      
      There are a few occasions that look similar to this. Always take the chunk with
      offsetofend(). Note that there is one where the fields differ in here:
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                                   sizeof_field(struct __sk_buff, tstamp),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
        >>>>>>> 7c8dce4b
      
      Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
      850a88cc ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").
      
      2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:
      
      (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)
      
        <<<<<<< HEAD
                if (is_13b_check(off, insn))
                        return -1;
                emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
        =======
                emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
        >>>>>>> 7c8dce4b
      
      Result should look like:
      
                emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
      
      3) Merge conflict in arch/riscv/include/asm/pgtable.h:
      
        <<<<<<< HEAD
        =======
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        #define vmemmap         ((struct page *)VMEMMAP_START)
      
        >>>>>>> 7c8dce4b
      
      Only take the BPF_* defines from there and move them higher up in the
      same file. Remove the rest from the chunk. The VMALLOC_* etc defines
      got moved via 01f52e16 ("riscv: define vmemmap before pfn_to_page
      calls"). Result:
      
        [...]
        #define __S101  PAGE_READ_EXEC
        #define __S110  PAGE_SHARED_EXEC
        #define __S111  PAGE_SHARED_EXEC
      
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        [...]
      
      Let me know if there are any other issues.
      
      Anyway, the main changes are:
      
      1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
         to a provided BPF object file. This provides an alternative, simplified API
         compared to standard libbpf interaction. Also, add libbpf extern variable
         resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.
      
      2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
         generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
         add various BPF riscv JIT improvements, from Björn Töpel.
      
      3) Extend bpftool to allow matching BPF programs and maps by name,
         from Paul Chaignon.
      
      4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
         flag for allowing updates without service interruption, from Andrey Ignatov.
      
      5) Cleanup and simplification of ring access functions for AF_XDP with a
         bonus of 0-5% performance improvement, from Magnus Karlsson.
      
      6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
         audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.
      
      7) Move and extend test_select_reuseport into BPF program tests under
         BPF selftests, from Jakub Sitnicki.
      
      8) Various BPF sample improvements for xdpsock for customizing parameters
         to set up and benchmark AF_XDP, from Jay Jayatheerthan.
      
      9) Improve libbpf to provide a ulimit hint on permission denied errors.
         Also change XDP sample programs to attach in driver mode by default,
         from Toke Høiland-Jørgensen.
      
      10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
          programs, from Nikita V. Shirokov.
      
      11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.
      
      12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
          libbpf conversion, from Jesper Dangaard Brouer.
      
      13) Minor misc improvements from various others.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bbc078f