1. 07 5月, 2020 8 次提交
    • P
      net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE · 16f80360
      Pablo Neira Ayuso 提交于
      This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
      that the frontend does not need counters, this hw stats type request
      never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
      the driver to disable the stats, however, if the driver cannot disable
      counters, it bails out.
      
      TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
      except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
      (this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
      TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.
      
      Fixes: 319a1d19 ("flow_offload: check for basic action hw stats type")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16f80360
    • O
      ethtool: provide UAPI for PHY master/slave configuration. · bdbdac76
      Oleksij Rempel 提交于
      This UAPI is needed for BroadR-Reach 100BASE-T1 devices. Due to lack of
      auto-negotiation support, we needed to be able to configure the
      MASTER-SLAVE role of the port manually or from an application in user
      space.
      
      The same UAPI can be used for 1000BASE-T or MultiGBASE-T devices to
      force MASTER or SLAVE role. See IEEE 802.3-2018:
      22.2.4.3.7 MASTER-SLAVE control register (Register 9)
      22.2.4.3.8 MASTER-SLAVE status register (Register 10)
      40.5.2 MASTER-SLAVE configuration resolution
      45.2.1.185.1 MASTER-SLAVE config value (1.2100.14)
      45.2.7.10 MultiGBASE-T AN control 1 register (Register 7.32)
      
      The MASTER-SLAVE role affects the clock configuration:
      
      -------------------------------------------------------------------------------
      When the  PHY is configured as MASTER, the PMA Transmit function shall
      source TX_TCLK from a local clock source. When configured as SLAVE, the
      PMA Transmit function shall source TX_TCLK from the clock recovered from
      data stream provided by MASTER.
      
      iMX6Q                     KSZ9031                XXX
      ------\                /-----------\        /------------\
            |                |           |        |            |
       MAC  |<----RGMII----->| PHY Slave |<------>| PHY Master |
            |<--- 125 MHz ---+-<------/  |        | \          |
      ------/                \-----------/        \------------/
                                                     ^
                                                      \-TX_TCLK
      
      -------------------------------------------------------------------------------
      
      Since some clock or link related issues are only reproducible in a
      specific MASTER-SLAVE-role, MAC and PHY configuration, it is beneficial
      to provide generic (not 100BASE-T1 specific) interface to the user space
      for configuration flexibility and trouble shooting.
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdbdac76
    • E
      tcp: refine tcp_pacing_delay() for very low pacing rates · 8dc242ad
      Eric Dumazet 提交于
      With the addition of horizon feature to sch_fq, we noticed some
      suboptimal behavior of extremely low pacing rate TCP flows, especially
      when TCP is not aware of a drop happening in lower stacks.
      
      Back in commit 3f80e08f ("tcp: add tcp_reset_xmit_timer() helper"),
      tcp_pacing_delay() was added to estimate an extra delay to add to standard
      rto timers.
      
      This patch removes the skb argument from this helper and
      tcp_reset_xmit_timer() because it makes more sense to simply
      consider the time at which next packet is allowed to be sent,
      instead of the time of whatever packet has been sent.
      
      This avoids arming RTO timer too soon and removes
      spurious horizon drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dc242ad
    • W
      net: stricter validation of untrusted gso packets · 9274124f
      Willem de Bruijn 提交于
      Syzkaller again found a path to a kernel crash through bad gso input:
      a packet with transport header extending beyond skb_headlen(skb).
      
      Tighten validation at kernel entry:
      
      - Verify that the transport header lies within the linear section.
      
          To avoid pulling linux/tcp.h, verify just sizeof tcphdr.
          tcp_gso_segment will call pskb_may_pull (th->doff * 4) before use.
      
      - Match the gso_type against the ip_proto found by the flow dissector.
      
      Fixes: bfd5f4a3 ("packet: Add GSO/csum offload support.")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9274124f
    • V
      net: dsa: ocelot: the MAC table on Felix is twice as large · 21ce7f3e
      Vladimir Oltean 提交于
      When running 'bridge fdb dump' on Felix, sometimes learnt and static MAC
      addresses would appear, sometimes they wouldn't.
      
      Turns out, the MAC table has 4096 entries on VSC7514 (Ocelot) and 8192
      entries on VSC9959 (Felix), so the existing code from the Ocelot common
      library only dumped half of Felix's MAC table. They are both organized
      as a 4-way set-associative TCAM, so we just need a single variable
      indicating the correct number of rows.
      
      Fixes: 56051948 ("net: dsa: ocelot: add driver for Felix switch family")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21ce7f3e
    • H
      timer: add fsleep for flexible sleeping · c6af13d3
      Heiner Kallweit 提交于
      Sleeping for a certain amount of time requires use of different
      functions, depending on the time period.
      Documentation/timers/timers-howto.rst explains when to use which
      function, and also checkpatch checks for some potentially
      problematic cases.
      
      So let's create a helper that automatically chooses the appropriate
      sleep function -> fsleep(), for flexible sleeping
      
      If the delay is a constant, then the compiler should be able to ensure
      that the new helper doesn't create overhead. If the delay is not
      constant, then the new helper can save some code.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6af13d3
    • F
      ipv6: Implement draft-ietf-6man-rfc4941bis · 969c5464
      Fernando Gont 提交于
      Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
      https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09
      
      * Reduces the default Valid Lifetime to 2 days
        The number of extra addresses employed when Valid Lifetime was
        7 days exacerbated the stress caused on network
        elements/devices. Additionally, the motivation for temporary
        addresses is indeed privacy and reduced exposure. With a
        default Valid Lifetime of 7 days, an address that becomes
        revealed by active communication is reachable and exposed for
        one whole week. The only use case for a Valid Lifetime of 7
        days could be some application that is expecting to have long
        lived connections. But if you want to have a long lived
        connections, you shouldn't be using a temporary address in the
        first place. Additionally, in the era of mobile devices, general
        applications should nevertheless be prepared and robust to
        address changes (e.g. nodes swap wifi <-> 4G, etc.)
      
      * Employs different IIDs for different prefixes
        To avoid network activity correlation among addresses configured
        for different prefixes
      
      * Uses a simpler algorithm for IID generation
        No need to store "history" anywhere
      Signed-off-by: NFernando Gont <fgont@si6networks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969c5464
    • M
      net: phy: add concept of shared storage for PHYs · 63490847
      Michael Walle 提交于
      There are packages which contain multiple PHY devices, eg. a quad PHY
      transceiver. Provide functions to allocate and free shared storage.
      
      Usually, a quad PHY contains global registers, which don't belong to any
      PHY. Provide convenience functions to access these registers.
      Signed-off-by: NMichael Walle <michael@walle.cc>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63490847
  2. 06 5月, 2020 2 次提交
  3. 05 5月, 2020 7 次提交
    • C
      bonding: remove useless stats_lock_key · e7511f56
      Cong Wang 提交于
      After commit b3e80d44
      ("bonding: fix lockdep warning in bond_get_stats()") the dynamic
      key is no longer necessary, as we compute nest level at run-time.
      So, we can just remove it to save some lockdep key entries.
      
      Test commands:
       ip link add bond0 type bond
       ip link add bond1 type bond
       ip link set bond0 master bond1
       ip link set bond0 nomaster
       ip link set bond1 master bond0
      
      Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Acked-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7511f56
    • C
      net: partially revert dynamic lockdep key changes · 1a33e10e
      Cong Wang 提交于
      This patch reverts the folowing commits:
      
      commit 064ff66e
      "bonding: add missing netdev_update_lockdep_key()"
      
      commit 53d37497
      "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"
      
      commit 1f26c0d3
      "net: fix kernel-doc warning in <linux/netdevice.h>"
      
      commit ab92d68f
      "net: core: add generic lockdep keys"
      
      but keeps the addr_list_lock_key because we still lock
      addr_list_lock nestedly on stack devices, unlikely xmit_lock
      this is safe because we don't take addr_list_lock on any fast
      path.
      
      Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Taehee Yoo <ap420073@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a33e10e
    • E
      net_sched: sch_fq: add horizon attribute · 39d01050
      Eric Dumazet 提交于
      QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN,
      to efficiently pace UDP packets.
      
      As far as sch_fq is concerned, we need to add safety checks, so
      that a buggy application does not fill the qdisc with packets
      having delivery time far in the future.
      
      This patch adds a configurable horizon (default: 10 seconds),
      and a configurable policy when a packet is beyond the horizon
      at enqueue() time:
      - either drop the packet (default policy)
      - or cap its delivery time to the horizon.
      
      $ tc -s -d qd sh dev eth0
      qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024
       orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit
       refill_delay 40.0ms timer_slack 10.000us horizon 10.000s
       Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6)
       backlog 0b 0p requeues 6
        flows 1191 (inactive 1177 throttled 0)
        gc 0 highprio 0 throttled 692 latency 11.480us
        pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0
      
      v2: fixed an overflow on 32bit kernels in fq_init(), reported
          by kbuild test robot <lkp@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39d01050
    • C
      net_sched: fix tcm_parent in tc filter dump · a7df4870
      Cong Wang 提交于
      When we tell kernel to dump filters from root (ffff:ffff),
      those filters on ingress (ffff:0000) are matched, but their
      true parents must be dumped as they are. However, kernel
      dumps just whatever we tell it, that is either ffff:ffff
      or ffff:0000:
      
       $ nl-cls-list --dev=dummy0 --parent=root
       cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all
       $ nl-cls-list --dev=dummy0 --parent=ffff:
       cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all
      
      This is confusing and misleading, more importantly this is
      a regression since 4.15, so the old behavior must be restored.
      
      And, when tc filters are installed on a tc class, the parent
      should be the classid, rather than the qdisc handle. Commit
      edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      removed the classid we save for filters, we can just restore
      this classid in tcf_block.
      
      Steps to reproduce this:
       ip li set dev dummy0 up
       tc qd add dev dummy0 ingress
       tc filter add dev dummy0 parent ffff: protocol arp basic action pass
       tc filter show dev dummy0 root
      
      Before this patch:
       filter protocol arp pref 49152 basic
       filter protocol arp pref 49152 basic handle 0x1
      	action order 1: gact action pass
      	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      After this patch:
       filter parent ffff: protocol arp pref 49152 basic
       filter parent ffff: protocol arp pref 49152 basic handle 0x1
       	action order 1: gact action pass
       	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      Fixes: a10fa201 ("net: sched: propagate q and parent from caller down to tcf_fill_node")
      Fixes: edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7df4870
    • H
      net: add helper eth_hw_addr_crc · b86cd700
      Heiner Kallweit 提交于
      Several drivers use the same code as basis for filter hashes. Therefore
      let's factor it out to a helper. This way drivers don't have to access
      struct netdev_hw_addr internals.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b86cd700
    • G
      uapi: revert flexible-array conversions · 1e6e9d0f
      Gustavo A. R. Silva 提交于
      These structures can get embedded in other structures in user-space
      and cause all sorts of warnings and problems. So, we better don't take
      any chances and keep the zero-length arrays in place for now.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      1e6e9d0f
    • L
      gcc-10 warnings: fix low-hanging fruit · 9d82973e
      Linus Torvalds 提交于
      Due to a bug-report that was compiler-dependent, I updated one of my
      machines to gcc-10.  That shows a lot of new warnings.  Happily they
      seem to be mostly the valid kind, but it's going to cause a round of
      churn for getting rid of them..
      
      This is the really low-hanging fruit of removing a couple of zero-sized
      arrays in some core code.  We have had a round of these patches before,
      and we'll have many more coming, and there is nothing special about
      these except that they were particularly trivial, and triggered more
      warnings than most.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d82973e
  4. 03 5月, 2020 2 次提交
  5. 02 5月, 2020 7 次提交
    • P
      net: schedule: add action gate offloading · d29bdd69
      Po Liu 提交于
      Add the gate action to the flow action entry. Add the gate parameters to
      the tc_setup_flow_action() queueing to the entries of flow_action_entry
      array provide to the driver.
      Signed-off-by: NPo Liu <Po.Liu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d29bdd69
    • P
      net: qos: introduce a gate control flow action · a51c328d
      Po Liu 提交于
      Introduce a ingress frame gate control flow action.
      Tc gate action does the work like this:
      Assume there is a gate allow specified ingress frames can be passed at
      specific time slot, and be dropped at specific time slot. Tc filter
      chooses the ingress frames, and tc gate action would specify what slot
      does these frames can be passed to device and what time slot would be
      dropped.
      Tc gate action would provide an entry list to tell how much time gate
      keep open and how much time gate keep state close. Gate action also
      assign a start time to tell when the entry list start. Then driver would
      repeat the gate entry list cyclically.
      For the software simulation, gate action requires the user assign a time
      clock type.
      
      Below is the setting example in user space. Tc filter a stream source ip
      address is 192.168.0.20 and gate action own two time slots. One is last
      200ms gate open let frame pass another is last 100ms gate close let
      frames dropped. When the ingress frames have reach total frames over
      8000000 bytes, the excessive frames will be dropped in that 200000000ns
      time slot.
      
      > tc qdisc add dev eth0 ingress
      
      > tc filter add dev eth0 parent ffff: protocol ip \
      	   flower src_ip 192.168.0.20 \
      	   action gate index 2 clockid CLOCK_TAI \
      	   sched-entry open 200000000 -1 8000000 \
      	   sched-entry close 100000000 -1 -1
      
      > tc chain del dev eth0 ingress chain 0
      
      "sched-entry" follow the name taprio style. Gate state is
      "open"/"close". Follow with period nanosecond. Then next item is internal
      priority value means which ingress queue should put. "-1" means
      wildcard. The last value optional specifies the maximum number of
      MSDU octets that are permitted to pass the gate during the specified
      time interval.
      Base-time is not set will be 0 as default, as result start time would
      be ((N + 1) * cycletime) which is the minimal of future time.
      
      Below example shows filtering a stream with destination mac address is
      10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
      action would run with one close time slot which means always keep close.
      The time cycle is total 200000000ns. The base-time would calculate by:
      
       1357000000000 + (N + 1) * cycletime
      
      When the total value is the future time, it will be the start time.
      The cycletime here would be 200000000ns for this case.
      
      > tc filter add dev eth0 parent ffff:  protocol ip \
      	   flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
      	   action gate index 12 base-time 1357000000000 \
      	   sched-entry close 200000000 -1 -1 \
      	   clockid CLOCK_TAI
      Signed-off-by: NPo Liu <Po.Liu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a51c328d
    • C
      net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX · f0628c52
      Cambda Zhu 提交于
      This patch changes the behavior of TCP_LINGER2 about its limit. The
      sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
      only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
      as the limit of TCP_LINGER2, which is 2 minutes.
      
      Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
      and the limit in the past, the system administrator cannot set the
      default value for most of sockets and let some sockets have a greater
      timeout. It might be a mistake that let the sysctl to be the limit of
      the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
      TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
      and 2 minutes are legal considering TCP specs.
      
      Changes in v3:
      - Remove the new socket option and change the TCP_LINGER2 behavior so
        that the timeout can be set to value between sysctl_tcp_fin_timeout
        and 2 minutes.
      
      Changes in v2:
      - Add int overflow check for the new socket option.
      
      Changes in v1:
      - Add a new socket option to set timeout greater than
        sysctl_tcp_fin_timeout.
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0628c52
    • D
      ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
      David Ahern 提交于
      Nik reported a bug with pcpu dst cache when nexthop objects are
      used illustrated by the following:
          $ ip netns add foo
          $ ip -netns foo li set lo up
          $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
          $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
          $ ip li add veth1 type veth peer name veth2
          $ ip li set veth1 up
          $ ip addr add 2001:db8:10::1/64 dev veth1
          $ ip li set dev veth2 netns foo
          $ ip -netns foo li set veth2 up
          $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
          $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Create a pcpu entry on cpu 0:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
      
          Re-add the route entry:
          $ ip -6 ro del 2001:db8:11::1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Route get on cpu 0 returns the stale pcpu:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
          RTNETLINK answers: Network is unreachable
      
          While cpu 1 works:
          $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
          2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
      
      Conversion of FIB entries to work with external nexthop objects
      missed an important difference between IPv4 and IPv6 - how dst
      entries are invalidated when the FIB changes. IPv4 has a per-network
      namespace generation id (rt_genid) that is bumped on changes to the FIB.
      Checking if a dst_entry is still valid means comparing rt_genid in the
      rtable to the current value of rt_genid for the namespace.
      
      IPv6 also has a per network namespace counter, fib6_sernum, but the
      count is saved per fib6_node. With the per-node counter only dst_entries
      based on fib entries under the node are invalidated when changes are
      made to the routes - limiting the scope of invalidations. IPv6 uses a
      reference in the rt6_info, 'from', to track the corresponding fib entry
      used to create the dst_entry. When validating a dst_entry, the 'from'
      is used to backtrack to the fib6_node and check the sernum of it to the
      cookie passed to the dst_check operation.
      
      With the inline format (nexthop definition inline with the fib6_info),
      dst_entries cached in the fib6_nh have a 1:1 correlation between fib
      entries, nexthop data and dst_entries. With external nexthops, IPv6
      looks more like IPv4 which means multiple fib entries across disparate
      fib6_nodes can all reference the same fib6_nh. That means validation
      of dst_entries based on external nexthops needs to use the IPv4 format
      - the per-network namespace counter.
      
      Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
      rt6_get_cookie to return sernum if it is set and update dst_check for
      IPv6 to look for sernum set and based the check on it if so. Finally,
      rt6_get_pcpu_route needs to validate the cached entry before returning
      a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
      __mkroute_output for IPv4).
      
      This problem only affects routes using the new, external nexthops.
      
      Thanks to the kbuild test robot for catching the IS_ENABLED needed
      around rt_genid_ipv6 before I sent this out.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f34e53b
    • S
      bpf: Bpf_{g,s}etsockopt for struct bpf_sock_addr · beecf11b
      Stanislav Fomichev 提交于
      Currently, bpf_getsockopt and bpf_setsockopt helpers operate on the
      'struct bpf_sock_ops' context in BPF_PROG_TYPE_SOCK_OPS program.
      Let's generalize them and make them available for 'struct bpf_sock_addr'.
      That way, in the future, we can allow those helpers in more places.
      
      As an example, let's expose those 'struct bpf_sock_addr' based helpers to
      BPF_CGROUP_INET{4,6}_CONNECT hooks. That way we can override CC before the
      connection is made.
      
      v3:
      * Expose custom helpers for bpf_sock_addr context instead of doing
        generic bpf_sock argument (as suggested by Daniel). Even with
        try_socket_lock that doesn't sleep we have a problem where context sk
        is already locked and socket lock is non-nestable.
      
      v2:
      * s/BPF_PROG_TYPE_CGROUP_SOCKOPT/BPF_PROG_TYPE_SOCK_OPS/
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200430233152.199403-1-sdf@google.com
      beecf11b
    • M
      docs: networking: convert x25-iface.txt to ReST · 883780af
      Mauro Carvalho Chehab 提交于
      Not much to be done here:
      
      - add SPDX header;
      - adjust title markup;
      - remove a tail whitespace;
      - add to networking/index.rst.
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      883780af
    • S
      bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS · d46edd67
      Song Liu 提交于
      Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
      Typical userspace tools use kernel.bpf_stats_enabled as follows:
      
        1. Enable kernel.bpf_stats_enabled;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Disable kernel.bpf_stats_enabled.
      
      The problem with this approach is that only one userspace tool can toggle
      this sysctl. If multiple tools toggle the sysctl at the same time, the
      measurement may be inaccurate.
      
      To fix this problem while keep backward compatibility, introduce a new
      bpf command BPF_ENABLE_STATS. On success, this command enables stats and
      returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
      only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
      command to support other types of stats in the future.
      
      With BPF_ENABLE_STATS, user space tool would have the following flow:
      
        1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Close the fd.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
      d46edd67
  6. 01 5月, 2020 14 次提交