1. 23 11月, 2019 1 次提交
  2. 22 11月, 2019 15 次提交
    • P
      ipv4: use dst hint for ipv4 list receive · 02b24941
      Paolo Abeni 提交于
      This is alike the previous change, with some additional ipv4 specific
      quirk. Even when using the route hint we still have to do perform
      additional per packet checks about source address validity: a new
      helper is added to wrap them.
      
      Hints are explicitly disabled if the destination is a local broadcast,
      that keeps the code simple and local broadcast are a slower path anyway.
      
      UDP flood performances vs recvmmsg() receiver:
      
      vanilla		patched		delta
      Kpps		Kpps		%
      1683		1871		+11
      
      In the worst case scenario - each packet has a different
      destination address - the performance delta is within noise
      range.
      
      v3 -> v4:
       - re-enable hints for forward
      
      v2 -> v3:
       - really fix build (sic) and hint usage check
       - use fib4_has_custom_rules() helpers (David A.)
       - add ip_extract_route_hint() helper (Edward C.)
       - use prev skb as hint instead of copying data (Willem)
      
      v1 -> v2:
       - fix build issue with !CONFIG_IP_MULTIPLE_TABLES
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02b24941
    • P
      ipv4: move fib4_has_custom_rules() helper to public header · c43c3d76
      Paolo Abeni 提交于
      So that we can use it in the next patch.
      Additionally constify the helper argument.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c43c3d76
    • P
      ipv6: introduce and uses route look hints for list input. · 197dbf24
      Paolo Abeni 提交于
      When doing RX batch packet processing, we currently always repeat
      the route lookup for each ingress packet. When no custom rules are
      in place, and there aren't routes depending on source addresses,
      we know that packets with the same destination address will use
      the same dst.
      
      This change tries to avoid per packet route lookup caching
      the destination address of the latest successful lookup, and
      reusing it for the next packet when the above conditions are
      in place. Ingress traffic for most servers should fit.
      
      The measured performance delta under UDP flood vs a recvmmsg
      receiver is as follow:
      
      vanilla		patched		delta
      Kpps		Kpps		%
      1431		1674		+17
      
      In the worst-case scenario - each packet has a different
      destination address - the performance delta is within noise
      range.
      
      v3 -> v4:
       - support hints for SUBFLOW build, too (David A.)
       - several style fixes (Eric)
      
      v2 -> v3:
       - add fib6_has_custom_rules() helpers (David A.)
       - add ip6_extract_route_hint() helper (Edward C.)
       - use hint directly in ip6_list_rcv_finish() (Willem)
      
      v1 -> v2:
       - fix build issue with !CONFIG_IPV6_MULTIPLE_TABLES
       - fix potential race when fib6_has_custom_rules is set
         while processing a packet batch
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      197dbf24
    • P
      ipv6: keep track of routes using src · b9b33e7c
      Paolo Abeni 提交于
      Use a per namespace counter, increment it on successful creation
      of any route using the source address, decrement it on deletion
      of such routes.
      
      This allows us to check easily if the routing decision in the
      current namespace depends on the packet source. Will be used
      by the next patch.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9b33e7c
    • Y
      net: dsa: ocelot: add hardware timestamping support for Felix · c0bcf537
      Yangbo Lu 提交于
      This patch is to reuse ocelot functions as possible to enable PTP
      clock and to support hardware timestamping on Felix.
      On TX path, timestamping works on packet which requires timestamp.
      The injection header will be configured accordingly, and skb clone
      requires timestamp will be added into a list. The TX timestamp
      is final handled in threaded interrupt handler when PTP timestamp
      FIFO is ready.
      On RX path, timestamping is always working. The RX timestamp could
      be got from extraction header.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bcf537
    • J
      bpf: skmsg, fix potential psock NULL pointer dereference · 8163999d
      John Fastabend 提交于
      Report from Dan Carpenter,
      
       net/core/skmsg.c:792 sk_psock_write_space()
       error: we previously assumed 'psock' could be null (see line 790)
      
       net/core/skmsg.c
         789 psock = sk_psock(sk);
         790 if (likely(psock && sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)))
       Check for NULL
         791 schedule_work(&psock->work);
         792 write_space = psock->saved_write_space;
                           ^^^^^^^^^^^^^^^^^^^^^^^^
         793          rcu_read_unlock();
         794          write_space(sk);
      
      Ensure psock dereference on line 792 only occurs if psock is not null.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8163999d
    • K
      net: Fix Kconfig indentation, continued · 43da1411
      Krzysztof Kozlowski 提交于
      Adjust indentation from spaces to tab (+optional two spaces) as in
      coding style.  This fixes various indentation mixups (seven spaces,
      tab+one space, etc).
      Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43da1411
    • X
      lwtunnel: check erspan options before allocating tun_info · 1841b982
      Xin Long 提交于
      As Jakub suggested on another patch, it's better to do the check
      on erspan options before allocating memory.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1841b982
    • X
      lwtunnel: be STRICT to validate the new LWTUNNEL_IP(6)_OPTS · 7b6a70f7
      Xin Long 提交于
      LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which
      are parsed by nla_parse_nested_deprecated(). We should check it
      strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS.
      
      This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy.
      
      Fixes: 4ece4778 ("lwtunnel: add options setting and dumping for geneve")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b6a70f7
    • X
      net: remove the unnecessary strict_start_type in some policies · f3bed7f8
      Xin Long 提交于
      ct_policy and mpls_policy are parsed with nla_parse_nested(), which
      does NL_VALIDATE_STRICT validation, strict_start_type is not needed
      to set as it is actually trying to make some attributes parsed with
      NL_VALIDATE_STRICT.
      
      This patch is to remove it, and do the same on rtm_nh_policy which
      is parsed by nlmsg_parse().
      Suggested-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3bed7f8
    • X
      net: sched: allow flower to match erspan options · 79b1011c
      Xin Long 提交于
      This patch is to allow matching options in erspan.
      
      The options can be described in the form:
      VER:INDEX:DIR:HWID/VER:INDEX_MASK:DIR_MASK:HWID_MASK.
      When ver is set to 1, index will be applied while dir
      and hwid will be ignored, and when ver is set to 2,
      dir and hwid will be used while index will be ignored.
      
      Different from geneve, only one option can be set. And
      also, geneve options, vxlan options or erspan options
      can't be set at the same time.
      
        # ip link add name erspan1 type erspan external
        # tc qdisc add dev erspan1 ingress
        # tc filter add dev erspan1 protocol ip parent ffff: \
            flower \
              enc_src_ip 10.0.99.192 \
              enc_dst_ip 10.0.99.193 \
              enc_key_id 11 \
              erspan_opts 1:12:0:0/1:ffff:0:0 \
              ip_proto udp \
              action mirred egress redirect dev eth0
      
      v1->v2:
        - improve some err msgs of extack.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79b1011c
    • X
      net: sched: allow flower to match vxlan options · d8f9dfae
      Xin Long 提交于
      This patch is to allow matching gbp option in vxlan.
      
      The options can be described in the form GBP/GBP_MASK,
      where GBP is represented as a 32bit hexadecimal value.
      Different from geneve, only one option can be set. And
      also, geneve options and vxlan options can't be set at
      the same time.
      
        # ip link add name vxlan0 type vxlan dstport 0 external
        # tc qdisc add dev vxlan0 ingress
        # tc filter add dev vxlan0 protocol ip parent ffff: \
            flower \
              enc_src_ip 10.0.99.192 \
              enc_dst_ip 10.0.99.193 \
              enc_key_id 11 \
              vxlan_opts 01020304/ffffffff \
              ip_proto udp \
              action mirred egress redirect dev eth0
      
      v1->v2:
        - add .strict_start_type for enc_opts_policy as Jakub noticed.
        - use Duplicate instead of Wrong in err msg for extack as Jakub
          suggested.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8f9dfae
    • X
      net: sched: add erspan option support to act_tunnel_key · e20d4ff2
      Xin Long 提交于
      This patch is to allow setting erspan options using the
      act_tunnel_key action. Different from geneve options,
      only one option can be set. And also, geneve options,
      vxlan options or erspan options can't be set at the
      same time.
      
      Options are expressed as ver:index:dir:hwid, when ver
      is set to 1, index will be applied while dir and hwid
      will be ignored, and when ver is set to 2, dir and
      hwid will be used while index will be ignored.
      
        # ip link add name erspan1 type erspan external
        # tc qdisc add dev eth0 ingress
        # tc filter add dev eth0 protocol ip parent ffff: \
                 flower indev eth0 \
                    ip_proto udp \
                    action tunnel_key \
                        set src_ip 10.0.99.192 \
                        dst_ip 10.0.99.193 \
                        dst_port 6081 \
                        id 11 \
        		erspan_opts 1:2:0:0 \
                action mirred egress redirect dev erspan1
      
      v1->v2:
        - do the validation when dst is not yet allocated as Jakub suggested.
        - use Duplicate instead of Wrong in err msg for extack.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e20d4ff2
    • X
      net: sched: add vxlan option support to act_tunnel_key · fca3f91c
      Xin Long 提交于
      This patch is to allow setting vxlan options using the
      act_tunnel_key action. Different from geneve options,
      only one option can be set. And also, geneve options
      and vxlan options can't be set at the same time.
      
      gbp is the only param for vxlan options:
      
        # ip link add name vxlan0 type vxlan dstport 0 external
        # tc qdisc add dev eth0 ingress
        # tc filter add dev eth0 protocol ip parent ffff: \
                 flower indev eth0 \
                    ip_proto udp \
                    action tunnel_key \
                        set src_ip 10.0.99.192 \
                        dst_ip 10.0.99.193 \
                        dst_port 6081 \
                        id 11 \
        		  vxlan_opts 01020304 \
                action mirred egress redirect dev vxlan0
      
      v1->v2:
        - add .strict_start_type for enc_opts_policy as Jakub noticed.
        - use Duplicate instead of Wrong in err msg for extack as Jakub
          suggested.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fca3f91c
    • S
      vsock: avoid to assign transport if its initialization fails · 039fccca
      Stefano Garzarella 提交于
      If transport->init() fails, we can't assign the transport to the
      socket, because it's not initialized correctly, and any future
      calls to the transport callbacks would have an unexpected behavior.
      
      Fixes: c0cfa2d8 ("vsock: add multi-transports support")
      Reported-and-tested-by: syzbot+e2e5c07bf353b2f79daa@syzkaller.appspotmail.com
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      039fccca
  3. 21 11月, 2019 11 次提交
    • H
      tcp: warn if offset reach the maxlen limit when using snprintf · 9bb59a21
      Hangbin Liu 提交于
      snprintf returns the number of chars that would be written, not number
      of chars that were actually written. As such, 'offs' may get larger than
      'tbl.maxlen', causing the 'tbl.maxlen - offs' being < 0, and since the
      parameter is size_t, it would overflow.
      
      Since using scnprintf may hide the limit error, while the buffer is still
      enough now, let's just add a WARN_ON_ONCE in case it reach the limit
      in future.
      
      v2: Use WARN_ON_ONCE as Jiri and Eric suggested.
      Suggested-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bb59a21
    • W
      ip_gre: Make none-tun-dst gre tunnel store tunnel info as metadat_dst in recv · c0d59da7
      wenxu 提交于
      Currently collect_md gre tunnel will store the tunnel info(metadata_dst)
      to skb_dst.
      And now the non-tun-dst gre tunnel already can add tunnel header through
      lwtunnel.
      
      When received a arp_request on the non-tun-dst gre tunnel. The packet of
      arp response will send through the non-tun-dst tunnel without tunnel info
      which will lead the arp response packet to be dropped.
      
      If the non-tun-dst gre tunnel also store the tunnel info as metadata_dst,
      The arp response packet will set the releted tunnel info in the
      iptunnel_metadata_reply.
      
      The following is the test script:
      
      ip netns add cl
      ip l add dev vethc type veth peer name eth0 netns cl
      
      ifconfig vethc 172.168.0.7/24 up
      ip l add dev tun1000 type gretap key 1000
      
      ip link add user1000 type vrf table 1
      ip l set user1000 up
      ip l set dev tun1000 master user1000
      ifconfig tun1000 10.0.1.1/24 up
      
      ip netns exec cl ifconfig eth0 172.168.0.17/24 up
      ip netns exec cl ip l add dev tun type gretap local 172.168.0.17 remote 172.168.0.7 key 1000
      ip netns exec cl ifconfig tun 10.0.1.7/24 up
      ip r r 10.0.1.7 encap ip id 1000 dst 172.168.0.17 key dev tun1000 table 1
      
      With this patch
      ip netns exec cl ping 10.0.1.1 can success
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0d59da7
    • T
      net: ipconfig: Wait for deferred device probes · e2ffe3ff
      Thomas Bogendoerfer 提交于
      If network device drives are using deferred probing, it was possible
      that waiting for devices to show up in ipconfig was already over,
      when the device eventually showed up. By calling wait_for_device_probe()
      we now make sure deferred probing is done before checking for available
      devices.
      Signed-off-by: NThomas Bogendoerfer <tbogendoerfer@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ffe3ff
    • L
      net: page_pool: add the possibility to sync DMA memory for device · e68bc756
      Lorenzo Bianconi 提交于
      Introduce the following parameters in order to add the possibility to sync
      DMA memory for device before putting allocated pages in the page_pool
      caches:
      - PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that
        the driver gets from page_pool will be DMA-synced-for-device according
        to the length provided by the device driver. Please note DMA-sync-for-CPU
        is still device driver responsibility
      - offset: DMA address offset where the DMA engine starts copying rx data
      - max_len: maximum DMA memory size page_pool is allowed to flush. This
        is currently used in __page_pool_alloc_pages_slow routine when pages
        are allocated from page allocator
      These parameters are supposed to be set by device drivers.
      
      This optimization reduces the length of the DMA-sync-for-device.
      The optimization is valid because pages are initially
      DMA-synced-for-device as defined via max_len. At RX time, the driver
      will perform a DMA-sync-for-CPU on the memory for the packet length.
      What is important is the memory occupied by packet payload, because
      this is the area CPU is allowed to read and modify. As we don't track
      cache-lines written into by the CPU, simply use the packet payload length
      as dma_sync_size at page_pool recycle time. This also take into account
      any tail-extend.
      Tested-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68bc756
    • G
      net: sched: pie: enable timestamp based delay calculation · cec2975f
      Gautam Ramakrishnan 提交于
      RFC 8033 suggests an alternative approach to calculate the queue
      delay in PIE by using a timestamp on every enqueued packet. This
      patch adds an implementation of that approach and sets it as the
      default method to calculate queue delay. The previous method (based
      on Little's law) to calculate queue delay is set as optional.
      Signed-off-by: NGautam Ramakrishnan <gautamramk@gmail.com>
      Signed-off-by: NLeslie Monis <lesliemonis@gmail.com>
      Signed-off-by: NMohit P. Tahiliani <tahiliani@nitk.edu.in>
      Acked-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cec2975f
    • S
      page_pool: Don't recycle non-reusable pages · d5394610
      Saeed Mahameed 提交于
      A page is NOT reusable when at least one of the following is true:
      1) allocated when system was under some pressure. (page_is_pfmemalloc)
      2) belongs to a different NUMA node than pool->p.nid.
      
      To update pool->p.nid users should call page_pool_update_nid().
      
      Holding on to such pages in the pool will hurt the consumer performance
      when the pool migrates to a different numa node.
      
      Performance testing:
      XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
      while migrating rx ring irq from close to far numa:
      
      mlx5 internal page cache was locally disabled to get pure page pool
      results.
      
      CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
      NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)
      
      XDP Drop/TX single core:
      NUMA  | XDP  | Before    | After
      ---------------------------------------
      Close | Drop | 11   Mpps | 10.9 Mpps
      Far   | Drop | 4.4  Mpps | 5.8  Mpps
      
      Close | TX   | 6.5 Mpps  | 6.5 Mpps
      Far   | TX   | 3.5 Mpps  | 4  Mpps
      
      Improvement is about 30% drop packet rate, 15% tx packet rate for numa
      far test.
      No degradation for numa close tests.
      
      TCP single/multi cpu/stream:
      NUMA  | #cpu | Before  | After
      --------------------------------------
      Close | 1    | 18 Gbps | 18 Gbps
      Far   | 1    | 15 Gbps | 18 Gbps
      Close | 12   | 80 Gbps | 80 Gbps
      Far   | 12   | 68 Gbps | 80 Gbps
      
      In all test cases we see improvement for the far numa case, and no
      impact on the close numa case.
      
      The impact of adding a check per page is very negligible, and shows no
      performance degradation whatsoever, also functionality wise it seems more
      correct and more robust for page pool to verify when pages should be
      recycled, since page pool can't guarantee where pages are coming from.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5394610
    • S
      page_pool: Add API to update numa node · bc836748
      Saeed Mahameed 提交于
      Add page_pool_update_nid() to be called by page pool consumers when they
      detect numa node changes.
      
      It will update the page pool nid value to start allocating from the new
      effective numa node.
      
      This is to mitigate page pool allocating pages from a wrong numa node,
      where the pool was originally allocated, and holding on to pages that
      belong to a different numa node, which causes performance degradation.
      
      For pages that are already being consumed and could be returned to the
      pool by the consumer, in next patch we will add a check per page to avoid
      recycling them back to the pool and return them to the page allocator.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc836748
    • P
      netfilter: nft_payload: add C-VLAN offload support · 89d8fd44
      Pablo Neira Ayuso 提交于
      Match on h_vlan_encapsulated_proto and set up protocol dependency. Check
      for protocol dependency before accessing the tci field. Allow to match
      on the encapsulated ethertype too.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89d8fd44
    • P
      netfilter: nft_payload: add VLAN offload support · a82055af
      Pablo Neira Ayuso 提交于
      Match on ethertype and set up protocol dependency. Check for protocol
      dependency before accessing the tci field. Allow to match on the
      encapsulated ethertype too.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a82055af
    • P
      netfilter: nf_tables_offload: allow ethernet interface type only · 8819efc9
      Pablo Neira Ayuso 提交于
      Hardware offload support at this stage assumes an ethernet device in
      place. The flow dissector provides the intermediate representation to
      express this selector, so extend it to allow to store the interface
      type. Flower does not uses this, so skb_flow_dissect_meta() is not
      extended to match on this new field.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8819efc9
    • X
      lwtunnel: add support for multiple geneve opts · 2f1d370b
      Xin Long 提交于
      geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry
      multiple geneve opts, so it's necessary for lwtunnel to support adding
      multiple geneve opts in one lwtunnel route. But vxlan and erspan opts
      are still only allowed to add one option.
      
      With this patch, iproute2 could make it like:
      
        # ip r a 1.1.1.0/24 encap ip id 1 geneve_opts 0:0:12121212,1:2:12121212 \
          dst 10.1.0.2 dev geneve1
      
        # ip r a 1.1.1.0/24 encap ip id 1 vxlan_opts 456 \
          dst 10.1.0.2 dev erspan1
      
        # ip r a 1.1.1.0/24 encap ip id 1 erspan_opts 1:123:0:0 \
          dst 10.1.0.2 dev erspan1
      
      Which are pretty much like cls_flower and act_tunnel_key.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f1d370b
  4. 19 11月, 2019 4 次提交
  5. 18 11月, 2019 1 次提交
    • A
      bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails · 1e0bd5a0
      Andrii Nakryiko 提交于
      92117d84 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
      potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
      (32k). Due to using 32-bit counter, it's possible in practice to overflow
      refcounter and make it wrap around to 0, causing erroneous map free, while
      there are still references to it, causing use-after-free problems.
      
      But having a failing refcounting operations are problematic in some cases. One
      example is mmap() interface. After establishing initial memory-mapping, user
      is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
      splitting it into multiple non-contiguous regions. All this happening without
      any control from the users of mmap subsystem. Rather mmap subsystem sends
      notifications to original creator of memory mapping through open/close
      callbacks, which are optionally specified during initial memory mapping
      creation. These callbacks are used to maintain accurate refcount for bpf_map
      (see next patch in this series). The problem is that open() callback is not
      supposed to fail, because memory-mapped resource is set up and properly
      referenced. This is posing a problem for using memory-mapping with BPF maps.
      
      One solution to this is to maintain separate refcount for just memory-mappings
      and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
      There are similar use cases in current work on tcp-bpf, necessitating extra
      counter as well. This seems like a rather unfortunate and ugly solution that
      doesn't scale well to various new use cases.
      
      Another approach to solve this is to use non-failing refcount_t type, which
      uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
      stays there. This utlimately causes memory leak, but prevents use after free.
      
      But given refcounting is not the most performance-critical operation with BPF
      maps (it's not used from running BPF program code), we can also just switch to
      64-bit counter that can't overflow in practice, potentially disadvantaging
      32-bit platforms a tiny bit. This simplifies semantics and allows above
      described scenarios to not worry about failing refcount increment operation.
      
      In terms of struct bpf_map size, we are still good and use the same amount of
      space:
      
      BEFORE (3 cache lines, 8 bytes of padding at the end):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
      	atomic_t                   usercnt;              /*   132     4 */
      	struct work_struct work;                         /*   136    32 */
      	char                       name[16];             /*   168    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 146, holes: 1, sum holes: 38 */
      	/* padding: 8 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      AFTER (same 3 cache lines, no extra padding now):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
      	atomic64_t                 usercnt;              /*   136     8 */
      	struct work_struct work;                         /*   144    32 */
      	char                       name[16];             /*   176    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 154, holes: 1, sum holes: 38 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      This patch, while modifying all users of bpf_map_inc, also cleans up its
      interface to match bpf_map_put with separate operations for bpf_map_inc and
      bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
      respectively). Also, given there are no users of bpf_map_inc_not_zero
      specifying uref=true, remove uref flag and default to uref=false internally.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
      1e0bd5a0
  6. 17 11月, 2019 8 次提交
    • G
      ipmr: Fix skb headroom in ipmr_get_route(). · 7901cd97
      Guillaume Nault 提交于
      In route.c, inet_rtm_getroute_build_skb() creates an skb with no
      headroom. This skb is then used by inet_rtm_getroute() which may pass
      it to rt_fill_info() and, from there, to ipmr_get_route(). The later
      might try to reuse this skb by cloning it and prepending an IPv4
      header. But since the original skb has no headroom, skb_push() triggers
      skb_under_panic():
      
      skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0
      ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:108!
      invalid opcode: 0000 [#1] SMP KASAN PTI
      CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      RIP: 0010:skb_panic+0xbf/0xd0
      Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
      RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286
      RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822
      RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc
      RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0
      R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000
      R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0
      FS:  00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_push+0x7e/0x80
       ipmr_get_route+0x459/0x6fa
       rt_fill_info+0x692/0x9f0
       inet_rtm_getroute+0xd26/0xf20
       rtnetlink_rcv_msg+0x45d/0x630
       netlink_rcv_skb+0x1a5/0x220
       rtnetlink_rcv+0x15/0x20
       netlink_unicast+0x305/0x3a0
       netlink_sendmsg+0x575/0x730
       sock_sendmsg+0xb5/0xc0
       ___sys_sendmsg+0x497/0x4f0
       __sys_sendmsg+0xcb/0x150
       __x64_sys_sendmsg+0x48/0x50
       do_syscall_64+0xd2/0xac0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Actually the original skb used to have enough headroom, but the
      reserve_skb() call was lost with the introduction of
      inet_rtm_getroute_build_skb() by commit 404eb77e ("ipv4: support
      sport, dport and ip_proto in RTM_GETROUTE").
      
      We could reserve some headroom again in inet_rtm_getroute_build_skb(),
      but this function shouldn't be responsible for handling the special
      case of ipmr_get_route(). Let's handle that directly in
      ipmr_get_route() by calling skb_realloc_headroom() instead of
      skb_clone().
      
      Fixes: 404eb77e ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7901cd97
    • U
      net/smc: fix fastopen for non-blocking connect() · 8204df72
      Ursula Braun 提交于
      FASTOPEN does not work with SMC-sockets. Since SMC allows fallback to
      TCP native during connection start, the FASTOPEN setsockopts trigger
      this fallback, if the SMC-socket is still in state SMC_INIT.
      But if a FASTOPEN setsockopt is called after a non-blocking connect(),
      this is broken, and fallback does not make sense.
      This change complements
      commit cd206360 ("net/smc: avoid fallback in case of non-blocking connect")
      and fixes the syzbot reported problem "WARNING in smc_unhash_sk".
      
      Reported-by: syzbot+8488cc4cf1c9e09b8b86@syzkaller.appspotmail.com
      Fixes: e1bbdd57 ("net/smc: reduce sock_put() for fallback sockets")
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8204df72
    • A
      net: core: allow fast GRO for skbs with Ethernet header in head · 8aef998d
      Alexander Lobakin 提交于
      Commit 78d3fd0b ("gro: Only use skb_gro_header for completely
      non-linear packets") back in May'09 (v2.6.31-rc1) has changed the
      original condition '!skb_headlen(skb)' to
      'skb->mac_header == skb->tail' in gro_reset_offset() saying: "Since
      the drivers that need this optimisation all provide completely
      non-linear packets" (note that this condition has become the current
      'skb_mac_header(skb) == skb_tail_pointer(skb)' later with commmit
      ced14f68 ("net: Correct comparisons and calculations using
      skb->tail and skb-transport_header") without any functional changes).
      
      For now, we have the following rough statistics for v5.4-rc7:
      1) napi_gro_frags: 14
      2) napi_gro_receive with skb->head containing (most of) payload: 83
      3) napi_gro_receive with skb->head containing all the headers: 20
      4) napi_gro_receive with skb->head containing only Ethernet header: 2
      
      With the current condition, fast GRO with the usage of
      NAPI_GRO_CB(skb)->frag0 is available only in the [1] case.
      Packets pushed by [2] and [3] go through the 'slow' path, but
      it's not a problem for them as they already contain all the needed
      headers in skb->head, so pskb_may_pull() only moves skb->data.
      
      The layout of skbs in the fourth [4] case at the moment of
      dev_gro_receive() is identical to skbs that have come through [1],
      as napi_frags_skb() pulls Ethernet header to skb->head. The only
      difference is that the mentioned condition is always false for them,
      because skb_put() and friends irreversibly alter the tail pointer.
      They also go through the 'slow' path, but now every single
      pskb_may_pull() in every single .gro_receive() will call the *really*
      slow __pskb_pull_tail() to pull headers to head. This significantly
      decreases the overall performance for no visible reasons.
      
      The only two users of method [4] is:
      * drivers/staging/qlge
      * drivers/net/wireless/iwlwifi (all three variants: dvm, mvm, mvm-mq)
      
      Note that in case with wireless drivers we can't use [1]
      (napi_gro_frags()) at least for now and mac80211 stack always
      performs pushes and pulls anyways, so performance hit is inavoidable.
      
      At the moment of v2.6.31 the mentioned change was necessary (that's
      why I don't add the "Fixes:" tag), but it became obsolete since
      skb_gro_mac_header() has gone in commit a50e233c ("net-gro:
      restore frag0 optimization"), so we can simply revert the condition
      in gro_reset_offset() to allow skbs from [4] go through the 'fast'
      path just like in case [1].
      
      This was tested on a 600 MHz MIPS CPU and a custom driver and this
      patch gave boosts up to 40 Mbps to method [4] in both directions
      comparing to net-next, which made overall performance relatively
      close to [1] (without it, [4] is the slowest).
      
      v2:
      - Add more references and explanations to commit message
      - Fix some typos ibid
      - No functional changes
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8aef998d
    • D
      rds: ib: update WR sizes when bringing up connection · a36e629e
      Dag Moxnes 提交于
      Currently WR sizes are updated from rds_ib_sysctl_max_send_wr and
      rds_ib_sysctl_max_recv_wr when a connection is shut down. As a result,
      a connection being down while rds_ib_sysctl_max_send_wr or
      rds_ib_sysctl_max_recv_wr are updated, will not update the sizes when
      it comes back up.
      
      Move resizing of WRs to rds_ib_setup_qp so that connections will be setup
      with the most current WR sizes.
      Signed-off-by: NDag Moxnes <dag.moxnes@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a36e629e
    • J
      page_pool: do not release pool until inflight == 0. · c3f812ce
      Jonathan Lemon 提交于
      The page pool keeps track of the number of pages in flight, and
      it isn't safe to remove the pool until all pages are returned.
      
      Disallow removing the pool until all pages are back, so the pool
      is always available for page producers.
      
      Make the page pool responsible for its own delayed destruction
      instead of relying on XDP, so the page pool can be used without
      the xdp memory model.
      
      When all pages are returned, free the pool and notify xdp if the
      pool is registered with the xdp memory system.  Have the callback
      perform a table walk since some drivers (cpsw) may share the pool
      among multiple xdp_rxq_info.
      
      Note that the increment of pages_state_release_cnt may result in
      inflight == 0, resulting in the pool being released.
      
      Fixes: d956a048 ("xdp: force mem allocator removal and periodic warning")
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f812ce
    • U
      net/smc: remove unused constant · ab8536ca
      Ursula Braun 提交于
      Constant SMC_CLOSE_WAIT_LISTEN_CLCSOCK_TIME is defined, but since
      commit 3d502067 ("net/smc: simplify wait when closing listen socket")
      no longer used. Remove it.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab8536ca
    • U
      net/smc: use rcu_barrier() on module unload · 4ead9c96
      Ursula Braun 提交于
      Add rcu_barrier() to make sure no RCU readers or callbacks are
      pending when the module is unloaded.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ead9c96
    • U
      net/smc: guarantee removal of link groups in reboot · a33a803c
      Ursula Braun 提交于
      When rebooting it should be guaranteed all link groups are cleaned
      up and freed.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a33a803c