1. 07 5月, 2020 3 次提交
    • P
      net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE · 16f80360
      Pablo Neira Ayuso 提交于
      This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
      that the frontend does not need counters, this hw stats type request
      never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
      the driver to disable the stats, however, if the driver cannot disable
      counters, it bails out.
      
      TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
      except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
      (this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
      TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.
      
      Fixes: 319a1d19 ("flow_offload: check for basic action hw stats type")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16f80360
    • E
      tcp: refine tcp_pacing_delay() for very low pacing rates · 8dc242ad
      Eric Dumazet 提交于
      With the addition of horizon feature to sch_fq, we noticed some
      suboptimal behavior of extremely low pacing rate TCP flows, especially
      when TCP is not aware of a drop happening in lower stacks.
      
      Back in commit 3f80e08f ("tcp: add tcp_reset_xmit_timer() helper"),
      tcp_pacing_delay() was added to estimate an extra delay to add to standard
      rto timers.
      
      This patch removes the skb argument from this helper and
      tcp_reset_xmit_timer() because it makes more sense to simply
      consider the time at which next packet is allowed to be sent,
      instead of the time of whatever packet has been sent.
      
      This avoids arming RTO timer too soon and removes
      spurious horizon drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dc242ad
    • F
      ipv6: Implement draft-ietf-6man-rfc4941bis · 969c5464
      Fernando Gont 提交于
      Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
      https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09
      
      * Reduces the default Valid Lifetime to 2 days
        The number of extra addresses employed when Valid Lifetime was
        7 days exacerbated the stress caused on network
        elements/devices. Additionally, the motivation for temporary
        addresses is indeed privacy and reduced exposure. With a
        default Valid Lifetime of 7 days, an address that becomes
        revealed by active communication is reachable and exposed for
        one whole week. The only use case for a Valid Lifetime of 7
        days could be some application that is expecting to have long
        lived connections. But if you want to have a long lived
        connections, you shouldn't be using a temporary address in the
        first place. Additionally, in the era of mobile devices, general
        applications should nevertheless be prepared and robust to
        address changes (e.g. nodes swap wifi <-> 4G, etc.)
      
      * Employs different IIDs for different prefixes
        To avoid network activity correlation among addresses configured
        for different prefixes
      
      * Uses a simpler algorithm for IID generation
        No need to store "history" anywhere
      Signed-off-by: NFernando Gont <fgont@si6networks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969c5464
  2. 06 5月, 2020 1 次提交
    • W
      erspan: Add type I version 0 support. · f989d546
      William Tu 提交于
      The Type I ERSPAN frame format is based on the barebones
      IP + GRE(4-byte) encapsulation on top of the raw mirrored frame.
      Both type I and II use 0x88BE as protocol type. Unlike type II
      and III, no sequence number or key is required.
      To creat a type I erspan tunnel device:
        $ ip link add dev erspan11 type erspan \
                  local 172.16.1.100 remote 172.16.1.200 \
                  erspan_ver 0
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f989d546
  3. 05 5月, 2020 2 次提交
    • C
      bonding: remove useless stats_lock_key · e7511f56
      Cong Wang 提交于
      After commit b3e80d44
      ("bonding: fix lockdep warning in bond_get_stats()") the dynamic
      key is no longer necessary, as we compute nest level at run-time.
      So, we can just remove it to save some lockdep key entries.
      
      Test commands:
       ip link add bond0 type bond
       ip link add bond1 type bond
       ip link set bond0 master bond1
       ip link set bond0 nomaster
       ip link set bond1 master bond0
      
      Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Acked-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7511f56
    • C
      net_sched: fix tcm_parent in tc filter dump · a7df4870
      Cong Wang 提交于
      When we tell kernel to dump filters from root (ffff:ffff),
      those filters on ingress (ffff:0000) are matched, but their
      true parents must be dumped as they are. However, kernel
      dumps just whatever we tell it, that is either ffff:ffff
      or ffff:0000:
      
       $ nl-cls-list --dev=dummy0 --parent=root
       cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all
       $ nl-cls-list --dev=dummy0 --parent=ffff:
       cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all
      
      This is confusing and misleading, more importantly this is
      a regression since 4.15, so the old behavior must be restored.
      
      And, when tc filters are installed on a tc class, the parent
      should be the classid, rather than the qdisc handle. Commit
      edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      removed the classid we save for filters, we can just restore
      this classid in tcf_block.
      
      Steps to reproduce this:
       ip li set dev dummy0 up
       tc qd add dev dummy0 ingress
       tc filter add dev dummy0 parent ffff: protocol arp basic action pass
       tc filter show dev dummy0 root
      
      Before this patch:
       filter protocol arp pref 49152 basic
       filter protocol arp pref 49152 basic handle 0x1
      	action order 1: gact action pass
      	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      After this patch:
       filter parent ffff: protocol arp pref 49152 basic
       filter parent ffff: protocol arp pref 49152 basic handle 0x1
       	action order 1: gact action pass
       	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      Fixes: a10fa201 ("net: sched: propagate q and parent from caller down to tcf_fill_node")
      Fixes: edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7df4870
  4. 02 5月, 2020 4 次提交
    • P
      net: schedule: add action gate offloading · d29bdd69
      Po Liu 提交于
      Add the gate action to the flow action entry. Add the gate parameters to
      the tc_setup_flow_action() queueing to the entries of flow_action_entry
      array provide to the driver.
      Signed-off-by: NPo Liu <Po.Liu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d29bdd69
    • P
      net: qos: introduce a gate control flow action · a51c328d
      Po Liu 提交于
      Introduce a ingress frame gate control flow action.
      Tc gate action does the work like this:
      Assume there is a gate allow specified ingress frames can be passed at
      specific time slot, and be dropped at specific time slot. Tc filter
      chooses the ingress frames, and tc gate action would specify what slot
      does these frames can be passed to device and what time slot would be
      dropped.
      Tc gate action would provide an entry list to tell how much time gate
      keep open and how much time gate keep state close. Gate action also
      assign a start time to tell when the entry list start. Then driver would
      repeat the gate entry list cyclically.
      For the software simulation, gate action requires the user assign a time
      clock type.
      
      Below is the setting example in user space. Tc filter a stream source ip
      address is 192.168.0.20 and gate action own two time slots. One is last
      200ms gate open let frame pass another is last 100ms gate close let
      frames dropped. When the ingress frames have reach total frames over
      8000000 bytes, the excessive frames will be dropped in that 200000000ns
      time slot.
      
      > tc qdisc add dev eth0 ingress
      
      > tc filter add dev eth0 parent ffff: protocol ip \
      	   flower src_ip 192.168.0.20 \
      	   action gate index 2 clockid CLOCK_TAI \
      	   sched-entry open 200000000 -1 8000000 \
      	   sched-entry close 100000000 -1 -1
      
      > tc chain del dev eth0 ingress chain 0
      
      "sched-entry" follow the name taprio style. Gate state is
      "open"/"close". Follow with period nanosecond. Then next item is internal
      priority value means which ingress queue should put. "-1" means
      wildcard. The last value optional specifies the maximum number of
      MSDU octets that are permitted to pass the gate during the specified
      time interval.
      Base-time is not set will be 0 as default, as result start time would
      be ((N + 1) * cycletime) which is the minimal of future time.
      
      Below example shows filtering a stream with destination mac address is
      10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
      action would run with one close time slot which means always keep close.
      The time cycle is total 200000000ns. The base-time would calculate by:
      
       1357000000000 + (N + 1) * cycletime
      
      When the total value is the future time, it will be the start time.
      The cycletime here would be 200000000ns for this case.
      
      > tc filter add dev eth0 parent ffff:  protocol ip \
      	   flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
      	   action gate index 12 base-time 1357000000000 \
      	   sched-entry close 200000000 -1 -1 \
      	   clockid CLOCK_TAI
      Signed-off-by: NPo Liu <Po.Liu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a51c328d
    • C
      net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX · f0628c52
      Cambda Zhu 提交于
      This patch changes the behavior of TCP_LINGER2 about its limit. The
      sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
      only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
      as the limit of TCP_LINGER2, which is 2 minutes.
      
      Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
      and the limit in the past, the system administrator cannot set the
      default value for most of sockets and let some sockets have a greater
      timeout. It might be a mistake that let the sysctl to be the limit of
      the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
      TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
      and 2 minutes are legal considering TCP specs.
      
      Changes in v3:
      - Remove the new socket option and change the TCP_LINGER2 behavior so
        that the timeout can be set to value between sysctl_tcp_fin_timeout
        and 2 minutes.
      
      Changes in v2:
      - Add int overflow check for the new socket option.
      
      Changes in v1:
      - Add a new socket option to set timeout greater than
        sysctl_tcp_fin_timeout.
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0628c52
    • D
      ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
      David Ahern 提交于
      Nik reported a bug with pcpu dst cache when nexthop objects are
      used illustrated by the following:
          $ ip netns add foo
          $ ip -netns foo li set lo up
          $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
          $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
          $ ip li add veth1 type veth peer name veth2
          $ ip li set veth1 up
          $ ip addr add 2001:db8:10::1/64 dev veth1
          $ ip li set dev veth2 netns foo
          $ ip -netns foo li set veth2 up
          $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
          $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Create a pcpu entry on cpu 0:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
      
          Re-add the route entry:
          $ ip -6 ro del 2001:db8:11::1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Route get on cpu 0 returns the stale pcpu:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
          RTNETLINK answers: Network is unreachable
      
          While cpu 1 works:
          $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
          2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
      
      Conversion of FIB entries to work with external nexthop objects
      missed an important difference between IPv4 and IPv6 - how dst
      entries are invalidated when the FIB changes. IPv4 has a per-network
      namespace generation id (rt_genid) that is bumped on changes to the FIB.
      Checking if a dst_entry is still valid means comparing rt_genid in the
      rtable to the current value of rt_genid for the namespace.
      
      IPv6 also has a per network namespace counter, fib6_sernum, but the
      count is saved per fib6_node. With the per-node counter only dst_entries
      based on fib entries under the node are invalidated when changes are
      made to the routes - limiting the scope of invalidations. IPv6 uses a
      reference in the rt6_info, 'from', to track the corresponding fib entry
      used to create the dst_entry. When validating a dst_entry, the 'from'
      is used to backtrack to the fib6_node and check the sernum of it to the
      cookie passed to the dst_check operation.
      
      With the inline format (nexthop definition inline with the fib6_info),
      dst_entries cached in the fib6_nh have a 1:1 correlation between fib
      entries, nexthop data and dst_entries. With external nexthops, IPv6
      looks more like IPv4 which means multiple fib entries across disparate
      fib6_nodes can all reference the same fib6_nh. That means validation
      of dst_entries based on external nexthops needs to use the IPv4 format
      - the per-network namespace counter.
      
      Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
      rt6_get_cookie to return sernum if it is set and update dst_check for
      IPv6 to look for sernum set and based the check on it if so. Finally,
      rt6_get_pcpu_route needs to validate the cached entry before returning
      a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
      __mkroute_output for IPv4).
      
      This problem only affects routes using the new, external nexthops.
      
      Thanks to the kbuild test robot for catching the IS_ENABLED needed
      around rt_genid_ipv6 before I sent this out.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f34e53b
  5. 01 5月, 2020 11 次提交
    • T
      tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040 · b7237487
      Toke Høiland-Jørgensen 提交于
      RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header
      to the inner header if that inner header is already marked as ECT(0). When
      RFC 6040 decapsulation was implemented, this case of propagation was not
      added. This simply appears to be an oversight, so let's fix that.
      
      Fixes: eccc1bb8 ("tunnel: drop packet if ECN present with not-ECT")
      Reported-by: NBob Briscoe <ietf@bobbriscoe.net>
      Reported-by: NOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
      Cc: Dave Taht <dave.taht@gmail.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7237487
    • J
      netlink: add infrastructure to expose policies to userspace · d07dcf9a
      Johannes Berg 提交于
      Add, and use in generic netlink, helpers to dump out a netlink
      policy to userspace, including all the range validation data,
      nested policies etc.
      
      This lets userspace discover what the kernel understands.
      
      For families/commands other than generic netlink, the helpers
      need to be used directly in an appropriate command, or we can
      add some infrastructure (a new netlink family) that those can
      register their policies with for introspection. I'm not that
      familiar with non-generic netlink, so that's left out for now.
      
      The data exposed to userspace also includes min and max length
      for binary/string data, I've done that instead of letting the
      userspace tools figure out whether min/max is intended based
      on the type so that we can extend this later in the kernel, we
      might want to just use the range data for example.
      
      Because of this, I opted to not directly expose the NLA_*
      values, even if some of them are already exposed via BPF, as
      with min/max length we don't need to have different types here
      for NLA_BINARY/NLA_MIN_LEN/NLA_EXACT_LEN, we just make them
      all NL_ATTR_TYPE_BINARY with min/max length optionally set.
      
      Similarly, we don't really need NLA_MSECS, and perhaps can
      remove it in the future - but not if we encode it into the
      userspace API now. It gets mapped to NL_ATTR_TYPE_U64 here.
      
      Note that the exposing here corresponds to the strict policy
      interpretation, and NLA_UNSPEC items are omitted entirely.
      To get those, change them to NLA_MIN_LEN which behaves in
      exactly the same way, but is exposed.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d07dcf9a
    • J
      netlink: factor out policy range helpers · 2c28ae48
      Johannes Berg 提交于
      Add helpers to get the policy's signed/unsigned range
      validation data.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c28ae48
    • J
      netlink: remove NLA_EXACT_LEN_WARN · c7721c05
      Johannes Berg 提交于
      Use a validation type instead, so we can later expose
      the NLA_* values to userspace for policy descriptions.
      
      Some transformations were done with this spatch:
      
          @@
          identifier p;
          expression X, L, A;
          @@
          struct nla_policy p[X] = {
          [A] =
          -{ .type = NLA_EXACT_LEN_WARN, .len = L },
          +NLA_POLICY_EXACT_LEN_WARN(L),
          ...
          };
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7721c05
    • J
      netlink: allow NLA_MSECS to have range validation · da4063bd
      Johannes Berg 提交于
      Since NLA_MSECS is really equivalent to NLA_U64, allow
      it to have range validation as well.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da4063bd
    • J
      netlink: extend policy range validation · d06a09b9
      Johannes Berg 提交于
      Using a pointer to a struct indicating the min/max values,
      extend the ability to do range validation for arbitrary
      values. Small values in the s16 range can be kept in the
      policy directly.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d06a09b9
    • J
      netlink: remove type-unsafe validation_data pointer · 47a1494b
      Johannes Berg 提交于
      In the netlink policy, we currently have a void *validation_data
      that's pointing to different things:
       * a u32 value for bitfield32,
       * the netlink policy for nested/nested array
       * the string for NLA_REJECT
      
      Remove the pointer and place appropriate type-safe items in the
      union instead.
      
      While at it, completely dissolve the pointer for the bitfield32
      case and just put the value there directly.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47a1494b
    • E
      tcp: add hrtimer slack to sack compression · a70437cc
      Eric Dumazet 提交于
      Add a sysctl to control hrtimer slack, default of 100 usec.
      
      This gives the opportunity to reduce system overhead,
      and help very short RTT flows.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a70437cc
    • M
      docs: networking: convert radiotap-headers.txt to ReST · 66d495d0
      Mauro Carvalho Chehab 提交于
      - add SPDX header;
      - adjust title markup;
      - mark code blocks and literals as such;
      - adjust identation, whitespaces and blank lines where needed;
      - add to networking/index.rst.
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66d495d0
    • P
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni 提交于
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfde141e
    • P
      mptcp: consolidate synack processing. · 263e1201
      Paolo Abeni 提交于
      Currently the MPTCP code uses 2 hooks to process syn-ack
      packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
      callback.
      
      We can drop the first, moving the relevant code into the
      latter, reducing the hooking into the TCP code. This is
      also needed by the next patch.
      
      v1 -> v2:
       - use local tcp sock ptr instead of casting the sk variable
         several times - DaveM
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263e1201
  6. 29 4月, 2020 2 次提交
  7. 28 4月, 2020 2 次提交
    • P
      netfilter: nf_tables: allow up to 64 bytes in the set element data area · fdb9c405
      Pablo Neira Ayuso 提交于
      So far, the set elements could store up to 128-bits in the data area.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fdb9c405
    • H
      switchdev: mrp: Extend switchdev API to offload MRP · c284b545
      Horatiu Vultur 提交于
      Extend switchdev API to add support for MRP. The HW is notified in
      following cases:
      
      SWITCHDEV_OBJ_ID_MRP: This is used when a MRP instance is added/removed
        from the MRP ring.
      
      SWITCHDEV_OBJ_ID_RING_ROLE_MRP: This is used when the role of the node
        changes. The current supported roles are MRM and MRC.
      
      SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending
        MRP_Test frames on the mrp ring ports. This is called only on nodes that have
        the role MRM. In case this fails then the SW will generate the frames.
      
      SWITCHDEV_OBJ_ID_RING_STATE_STATE: This is used when the ring changes it states
        to open or closed. This is required to notify HW because the MRP_Test frame
        contains the field MRP_InState which contains this information.
      
      SWITCHDEV_ATTR_ID_MRP_PORT_STATE: This is used when the port's state is
        changed. It can be in blocking/forwarding mode.
      
      SWITCHDEV_ATTR_ID_MRP_PORT_ROLE: This is used when port's role changes. The
        roles of the port can be primary/secondary. This is required to notify HW
        because the MRP_Test frame contains the field MRP_PortRole that contains this
        information.
      Signed-off-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c284b545
  8. 26 4月, 2020 1 次提交
    • F
      tcp: mptcp: use mptcp receive buffer space to select rcv window · 071c8ed6
      Florian Westphal 提交于
      In MPTCP, the receive window is shared across all subflows, because it
      refers to the mptcp-level sequence space.
      
      MPTCP receivers already place incoming packets on the mptcp socket
      receive queue and will charge it to the mptcp socket rcvbuf until
      userspace consumes the data.
      
      Update __tcp_select_window to use the occupancy of the parent/mptcp
      socket instead of the subflow socket in case the tcp socket is part
      of a logical mptcp connection.
      
      This commit doesn't change choice of initial window for passive or active
      connections.
      While it would be possible to change those as well, this adds complexity
      (especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
      specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
      of a TCP segment at the connection level if it does not also carry a DSS
      option with a Data ACK field.'
      
      SYN/SYNACK packets do not carry a DSS option with a Data ACK field.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      071c8ed6
  9. 25 4月, 2020 2 次提交
  10. 24 4月, 2020 2 次提交
  11. 23 4月, 2020 2 次提交
  12. 19 4月, 2020 1 次提交
  13. 15 4月, 2020 3 次提交
  14. 14 4月, 2020 1 次提交
  15. 08 4月, 2020 3 次提交
    • T
      net: ipv6: do not consider routes via gateways for anycast address check · 03e2a984
      Tim Stallard 提交于
      The behaviour for what is considered an anycast address changed in
      commit 45e4fd26 ("ipv6: Only create RTF_CACHE routes after
      encountering pmtu exception"). This now considers the first
      address in a subnet where there is a route via a gateway
      to be an anycast address.
      
      This breaks path MTU discovery and traceroutes when a host in a
      remote network uses the address at the start of a prefix
      (eg 2600:: advertised as 2600::/48 in the DFZ) as ICMP errors
      will not be sent to anycast addresses.
      
      This patch excludes any routes with a gateway, or via point to
      point links, like the behaviour previously from
      rt6_is_gw_or_nonexthop in net/ipv6/route.c.
      
      This can be tested with:
      ip link add v1 type veth peer name v2
      ip netns add test
      ip netns exec test ip link set lo up
      ip link set v2 netns test
      ip link set v1 up
      ip netns exec test ip link set v2 up
      ip addr add 2001:db8::1/64 dev v1 nodad
      ip addr add 2001:db8:100:: dev lo nodad
      ip netns exec test ip addr add 2001:db8::2/64 dev v2 nodad
      ip netns exec test ip route add unreachable 2001:db8:1::1
      ip netns exec test ip route add 2001:db8:100::/64 via 2001:db8::1
      ip netns exec test sysctl net.ipv6.conf.all.forwarding=1
      ip route add 2001:db8:1::1 via 2001:db8::2
      ping -I 2001:db8::1 2001:db8:1::1 -c1
      ping -I 2001:db8:100:: 2001:db8:1::1 -c1
      ip addr delete 2001:db8:100:: dev lo
      ip netns delete test
      
      Currently the first ping will get back a destination unreachable ICMP
      error, but the second will never get a response, with "icmp6_send:
      acast source" logged. After this patch, both get destination
      unreachable ICMP replies.
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NTim Stallard <code@timstallard.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03e2a984
    • L
      net: sock.h: fix skb_steal_sock() kernel-doc · 045065f0
      Lothar Rubusch 提交于
      Fix warnings related to kernel-doc notation, and wording in
      function description.
      Signed-off-by: NLothar Rubusch <l.rubusch@gmail.com>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      045065f0
    • A
      Bluetooth: debugfs option to unset MITM flag · c2aa30db
      Archie Pusaka 提交于
      The BT qualification test SM/MAS/PKE/BV-01-C needs us to turn off
      the MITM flag when pairing, and at the same time also set the io
      capability to something other than no input no output.
      
      Currently the MITM flag is only unset when the io capability is set
      to no input no output, therefore the test cannot be executed.
      
      This patch introduces a debugfs option to force MITM flag to be
      turned off.
      Signed-off-by: NArchie Pusaka <apusaka@chromium.org>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      c2aa30db