1. 07 12月, 2019 2 次提交
    • G
      tcp: fix rejected syncookies due to stale timestamps · 04d26e7b
      Guillaume Nault 提交于
      If no synflood happens for a long enough period of time, then the
      synflood timestamp isn't refreshed and jiffies can advance so much
      that time_after32() can't accurately compare them any more.
      
      Therefore, we can end up in a situation where time_after32(now,
      last_overflow + HZ) returns false, just because these two values are
      too far apart. In that case, the synflood timestamp isn't updated as
      it should be, which can trick tcp_synq_no_recent_overflow() into
      rejecting valid syncookies.
      
      For example, let's consider the following scenario on a system
      with HZ=1000:
      
        * The synflood timestamp is 0, either because that's the timestamp
          of the last synflood or, more commonly, because we're working with
          a freshly created socket.
      
        * We receive a new SYN, which triggers synflood protection. Let's say
          that this happens when jiffies == 2147484649 (that is,
          'synflood timestamp' + HZ + 2^31 + 1).
      
        * Then tcp_synq_overflow() doesn't update the synflood timestamp,
          because time_after32(2147484649, 1000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 1000: the value of 'last_overflow' + HZ.
      
        * A bit later, we receive the ACK completing the 3WHS. But
          cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
          says that we're not under synflood. That's because
          time_after32(2147484649, 120000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
      
          Of course, in reality jiffies would have increased a bit, but this
          condition will last for the next 119 seconds, which is far enough
          to accommodate for jiffie's growth.
      
      Fix this by updating the overflow timestamp whenever jiffies isn't
      within the [last_overflow, last_overflow + HZ] range. That shouldn't
      have any performance impact since the update still happens at most once
      per second.
      
      Now we're guaranteed to have fresh timestamps while under synflood, so
      tcp_synq_no_recent_overflow() can safely use it with time_after32() in
      such situations.
      
      Stale timestamps can still make tcp_synq_no_recent_overflow() return
      the wrong verdict when not under synflood. This will be handled in the
      next patch.
      
      For 64 bits architectures, the problem was introduced with the
      conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
      cca9bab1 ("tcp: use monotonic timestamps for PAWS").
      The problem has always been there on 32 bits architectures.
      
      Fixes: cca9bab1 ("tcp: use monotonic timestamps for PAWS")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d26e7b
    • J
      net: core: rename indirect block ingress cb function · dbad3408
      John Hurley 提交于
      With indirect blocks, a driver can register for callbacks from a device
      that is does not 'own', for example, a tunnel device. When registering to
      or unregistering from a new device, a callback is triggered to generate
      a bind/unbind event. This, in turn, allows the driver to receive any
      existing rules or to properly clean up installed rules.
      
      When first added, it was assumed that all indirect block registrations
      would be for ingress offloads. However, the NFP driver can, in some
      instances, support clsact qdisc binds for egress offload.
      
      Change the name of the indirect block callback command in flow_offload to
      remove the 'ingress' identifier from it. While this does not change
      functionality, a follow up patch will implement a more more generic
      callback than just those currently just supporting ingress offload.
      
      Fixes: 4d12ba42 ("nfp: flower: allow offloading of matches on 'internal' ports")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbad3408
  2. 05 12月, 2019 2 次提交
  3. 04 12月, 2019 1 次提交
    • Y
      cls_flower: Fix the behavior using port ranges with hw-offload · 8ffb055b
      Yoshiki Komachi 提交于
      The recent commit 5c72299f ("net: sched: cls_flower: Classify
      packets using port ranges") had added filtering based on port ranges
      to tc flower. However the commit missed necessary changes in hw-offload
      code, so the feature gave rise to generating incorrect offloaded flow
      keys in NIC.
      
      One more detailed example is below:
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 100-200 action drop
      
      With the setup above, an exact match filter with dst_port == 0 will be
      installed in NIC by hw-offload. IOW, the NIC will have a rule which is
      equivalent to the following one.
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 0 action drop
      
      The behavior was caused by the flow dissector which extracts packet
      data into the flow key in the tc flower. More specifically, regardless
      of exact match or specified port ranges, fl_init_dissector() set the
      FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
      numbers from skb in skb_flow_dissect() called by fl_classify(). Note
      that device drivers received the same struct flow_dissector object as
      used in skb_flow_dissect(). Thus, offloaded drivers could not identify
      which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
      set to struct flow_dissector in either case.
      
      This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
      tp_range field in struct fl_flow_key to recognize which filters are applied
      to offloaded drivers. At this point, when filters based on port ranges
      passed to drivers, drivers return the EOPNOTSUPP error because they do
      not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
      flag).
      
      Fixes: 5c72299f ("net: sched: cls_flower: Classify packets using port ranges")
      Signed-off-by: NYoshiki Komachi <komachi.yoshiki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ffb055b
  4. 29 11月, 2019 2 次提交
  5. 27 11月, 2019 2 次提交
  6. 24 11月, 2019 1 次提交
    • X
      sctp: cache netns in sctp_ep_common · 31243461
      Xin Long 提交于
      This patch is to fix a data-race reported by syzbot:
      
        BUG: KCSAN: data-race in sctp_assoc_migrate / sctp_hash_obj
      
        write to 0xffff8880b67c0020 of 8 bytes by task 18908 on cpu 1:
          sctp_assoc_migrate+0x1a6/0x290 net/sctp/associola.c:1091
          sctp_sock_migrate+0x8aa/0x9b0 net/sctp/socket.c:9465
          sctp_accept+0x3c8/0x470 net/sctp/socket.c:4916
          inet_accept+0x7f/0x360 net/ipv4/af_inet.c:734
          __sys_accept4+0x224/0x430 net/socket.c:1754
          __do_sys_accept net/socket.c:1795 [inline]
          __se_sys_accept net/socket.c:1792 [inline]
          __x64_sys_accept+0x4e/0x60 net/socket.c:1792
          do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        read to 0xffff8880b67c0020 of 8 bytes by task 12003 on cpu 0:
          sctp_hash_obj+0x4f/0x2d0 net/sctp/input.c:894
          rht_key_get_hash include/linux/rhashtable.h:133 [inline]
          rht_key_hashfn include/linux/rhashtable.h:159 [inline]
          rht_head_hashfn include/linux/rhashtable.h:174 [inline]
          head_hashfn lib/rhashtable.c:41 [inline]
          rhashtable_rehash_one lib/rhashtable.c:245 [inline]
          rhashtable_rehash_chain lib/rhashtable.c:276 [inline]
          rhashtable_rehash_table lib/rhashtable.c:316 [inline]
          rht_deferred_worker+0x468/0xab0 lib/rhashtable.c:420
          process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
          worker_thread+0xa0/0x800 kernel/workqueue.c:2415
          kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
          ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      It was caused by rhashtable access asoc->base.sk when sctp_assoc_migrate
      is changing its value. However, what rhashtable wants is netns from asoc
      base.sk, and for an asoc, its netns won't change once set. So we can
      simply fix it by caching netns since created.
      
      Fixes: d6c0256a ("sctp: add the rhashtable apis for sctp global transport hashtable")
      Reported-by: syzbot+e3b35fe7918ff0ee474e@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      31243461
  7. 23 11月, 2019 2 次提交
  8. 22 11月, 2019 7 次提交
    • T
      mac80211: Use Airtime-based Queue Limits (AQL) on packet dequeue · 7a89233a
      Toke Høiland-Jørgensen 提交于
      The previous commit added the ability to throttle stations when they queue
      too much airtime in the hardware. This commit enables the functionality by
      calculating the expected airtime usage of each packet that is dequeued from
      the TXQs in mac80211, and accounting that as pending airtime.
      
      The estimated airtime for each skb is stored in the tx_info, so we can
      subtract the same amount from the running total when the skb is freed or
      recycled. The throttling mechanism relies on this accounting to be
      accurate (i.e., that we are not freeing skbs without subtracting any
      airtime they were accounted for), so we put the subtraction into
      ieee80211_report_used_skb(). As an optimisation, we also subtract the
      airtime on regular TX completion, zeroing out the value stored in the
      packet afterwards, to avoid having to do an expensive lookup of the station
      from the packet data on every packet.
      
      This patch does *not* include any mechanism to wake a throttled TXQ again,
      on the assumption that this will happen anyway as a side effect of whatever
      freed the skb (most commonly a TX completion).
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20191119060610.76681-5-kyan@google.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      7a89233a
    • K
      mac80211: Implement Airtime-based Queue Limit (AQL) · 3ace10f5
      Kan Yan 提交于
      In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate
      effectively to control excessive queueing latency, the CoDel algorithm
      requires an accurate measure of how long packets stays in the queue, AKA
      sojourn time. The sojourn time measured at the mac80211 layer doesn't
      include queueing latency in the lower layer (firmware/hardware) and CoDel
      expects lower layer to have a short queue. However, most 802.11ac chipsets
      offload tasks such TX aggregation to firmware or hardware, thus have a deep
      lower layer queue.
      
      Without a mechanism to control the lower layer queue size, packets only
      stay in mac80211 layer transiently before being sent to firmware queue.
      As a result, the sojourn time measured by CoDel in the mac80211 layer is
      almost always lower than the CoDel latency target, hence CoDel does little
      to control the latency, even when the lower layer queue causes excessive
      latency.
      
      The Byte Queue Limits (BQL) mechanism is commonly used to address the
      similar issue with wired network interface. However, this method cannot be
      applied directly to the wireless network interface. "Bytes" is not a
      suitable measure of queue depth in the wireless network, as the data rate
      can vary dramatically from station to station in the same network, from a
      few Mbps to over Gbps.
      
      This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work
      effectively with wireless drivers that utilized firmware/hardware
      offloading. AQL allows each txq to release just enough packets to the lower
      layer to form 1-2 large aggregations to keep hardware fully utilized and
      retains the rest of the frames in mac80211 layer to be controlled by the
      CoDel algorithm.
      Signed-off-by: NKan Yan <kyan@google.com>
      [ Toke: Keep API to set pending airtime internal, fix nits in commit msg ]
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      3ace10f5
    • T
      mac80211: Import airtime calculation code from mt76 · db3e1c40
      Toke Høiland-Jørgensen 提交于
      Felix recently added code to calculate airtime of packets to the mt76
      driver. Import this into mac80211 so we can use it for airtime queue limit
      calculations.
      
      The airtime.c file is copied verbatim from the mt76 driver, and adjusted to
      be usable in mac80211. This involves:
      
      - Switching to mac80211 data structures.
      - Adding support for 160 MHz channels and HE mode.
      - Moving the symbol and duration calculations around a bit to avoid
        rounding with the higher rates and longer symbol times used for HE rates.
      
      The per-rate TX rate calculation is also split out to its own function so
      it can be used directly for the AQL calculations later.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20191119060610.76681-3-kyan@google.com
      [fix HE_GROUP_IDX() to use 3 * bw, since there are 3 _gi values]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      db3e1c40
    • P
      ipv4: use dst hint for ipv4 list receive · 02b24941
      Paolo Abeni 提交于
      This is alike the previous change, with some additional ipv4 specific
      quirk. Even when using the route hint we still have to do perform
      additional per packet checks about source address validity: a new
      helper is added to wrap them.
      
      Hints are explicitly disabled if the destination is a local broadcast,
      that keeps the code simple and local broadcast are a slower path anyway.
      
      UDP flood performances vs recvmmsg() receiver:
      
      vanilla		patched		delta
      Kpps		Kpps		%
      1683		1871		+11
      
      In the worst case scenario - each packet has a different
      destination address - the performance delta is within noise
      range.
      
      v3 -> v4:
       - re-enable hints for forward
      
      v2 -> v3:
       - really fix build (sic) and hint usage check
       - use fib4_has_custom_rules() helpers (David A.)
       - add ip_extract_route_hint() helper (Edward C.)
       - use prev skb as hint instead of copying data (Willem)
      
      v1 -> v2:
       - fix build issue with !CONFIG_IP_MULTIPLE_TABLES
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02b24941
    • P
      ipv4: move fib4_has_custom_rules() helper to public header · c43c3d76
      Paolo Abeni 提交于
      So that we can use it in the next patch.
      Additionally constify the helper argument.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c43c3d76
    • P
      ipv6: keep track of routes using src · b9b33e7c
      Paolo Abeni 提交于
      Use a per namespace counter, increment it on successful creation
      of any route using the source address, decrement it on deletion
      of such routes.
      
      This allows us to check easily if the routing decision in the
      current namespace depends on the packet source. Will be used
      by the next patch.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9b33e7c
    • P
      ipv6: add fib6_has_custom_rules() helper · 1f8ac570
      Paolo Abeni 提交于
      It wraps the namespace field with the same name, to easily
      access it regardless of build options.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8ac570
  9. 21 11月, 2019 5 次提交
  10. 20 11月, 2019 1 次提交
  11. 19 11月, 2019 1 次提交
    • J
      page_pool: add destroy attempts counter and rename tracepoint · 7c9e6942
      Jesper Dangaard Brouer 提交于
      When Jonathan change the page_pool to become responsible to its
      own shutdown via deferred work queue, then the disconnect_cnt
      counter was removed from xdp memory model tracepoint.
      
      This patch change the page_pool_inflight tracepoint name to
      page_pool_release, because it reflects the new responsability
      better.  And it reintroduces a counter that reflect the number of
      times page_pool_release have been tried.
      
      The counter is also used by the code, to only empty the alloc
      cache once.  With a stuck work queue running every second and
      counter being 64-bit, it will overrun in approx 584 billion
      years. For comparison, Earth lifetime expectancy is 7.5 billion
      years, before the Sun will engulf, and destroy, the Earth.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c9e6942
  12. 17 11月, 2019 1 次提交
  13. 16 11月, 2019 4 次提交
    • P
      netfilter: nf_flow_table_offload: add IPv6 support · 5c27d8d7
      Pablo Neira Ayuso 提交于
      Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet
      flowtable type definitions. Rename the nf_flow_rule_route() function to
      nf_flow_rule_route_ipv4().
      
      Adjust maximum number of actions, which now becomes 16 to leave
      sufficient room for the IPv6 address mangling for NAT.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5c27d8d7
    • V
      net: dsa: ocelot: add tagger for Ocelot/Felix switches · 8dce89aa
      Vladimir Oltean 提交于
      While it is entirely possible that this tagger format is in fact more
      generic than just these 2 switch families, I don't have that knowledge.
      The Seville switch in NXP T1040 has a similar frame format, but there
      are enough differences (e.g. DEST field starts at bit 57 instead of 56)
      that calling this file tag_vitesse.c is a bit of a stretch at the
      moment. The frame format has been listed in a comment so that people who
      add support for further Vitesse switches can rework this tagger while
      keeping compatibility with Felix.
      
      The "ocelot" name was chosen instead of "felix" because even the Ocelot
      switch can act as a DSA device when it is used in NPI mode, and the Felix
      tagger format is almost identical. Currently it is only used for the
      Felix switch embedded in the NXP LS1028A chip.
      
      The ABI for this tagger should be considered "not stable" at the moment.
      The DSA tag is always placed before the Ethernet header and therefore,
      we are using the long prefix for RX tags to avoid putting the DSA master
      port in promiscuous mode. Once there will be an API in DSA for drivers
      to request DSA masters to be in promiscuous mode unconditionally, we
      will switch to the "no prefix" extraction frame header, which will save
      16 padding bytes for each RX frame.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dce89aa
    • U
      net/smc: introduce bookkeeping of SMCD link groups · 5edd6b9c
      Ursula Braun 提交于
      If the ism module is unloaded return control from exit routine only,
      if all link groups are freed.
      If an IB device is thrown away return control from device removal only,
      if all link groups belonging to this device are freed.
      A counters for the total number of SMCD link groups per ISM device is
      introduced. ism module unloading continues only if the total number of
      SMCD link groups for all ISM devices is zero. ISM device
      removal continues only it the total number of SMCD link groups per ISM
      device has decreased to zero.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5edd6b9c
    • U
      net/smc: immediate termination for SMCD link groups · 42bfba9e
      Ursula Braun 提交于
      SMCD link group termination is called when peer signals its shutdown
      of its corresponding link group. For regular shutdowns no connections
      exist anymore. For abnormal shutdowns connections must be killed and
      their DMBs must be unregistered immediately. That means the SMCR method
      to delay the link group freeing several seconds does not fit.
      
      This patch adds immediate termination of a link group and its SMCD
      connections and makes sure all SMCD link group related cleanup steps
      are finished.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42bfba9e
  14. 15 11月, 2019 7 次提交
  15. 13 11月, 2019 2 次提交
    • P
      netfilter: nft_meta: offload support for interface index · 25da5eb3
      Pablo Neira Ayuso 提交于
      This patch adds support for offloading the NFT_META_IIF selector.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      25da5eb3
    • P
      netfilter: nf_flow_table: hardware offload support · c29f74e0
      Pablo Neira Ayuso 提交于
      This patch adds the dataplane hardware offload to the flowtable
      infrastructure. Three new flags represent the hardware state of this
      flow:
      
      * FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
      * FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
        from hardware. This might be triggered by either packet path (via TCP
        RST/FIN packet) or via aging.
      * FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
        the hardware, the software garbage collector can remove it from the
        software flowtable.
      
      This patch supports for:
      
      * IPv4 only.
      * Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
        at this stage.
      
      This patch also adds the action callback that specifies how to convert
      the flow entry into the flow_rule object that is passed to the driver.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c29f74e0