1. 20 4月, 2018 4 次提交
  2. 19 4月, 2018 2 次提交
    • E
      ipv6: frags: fix a lockdep false positive · 415787d7
      Eric Dumazet 提交于
      lockdep does not know that the locks used by IPv4 defrag
      and IPv6 reassembly units are of different classes.
      
      It complains because of following chains :
      
      1) sch_direct_xmit()        (lock txq->_xmit_lock)
          dev_hard_start_xmit()
           xmit_one()
            dev_queue_xmit_nit()
             packet_rcv_fanout()
              ip_check_defrag()
               ip_defrag()
                spin_lock()     (lock frag queue spinlock)
      
      2) ip6_input_finish()
          ipv6_frag_rcv()       (lock frag queue spinlock)
           ip6_frag_queue()
            icmpv6_param_prob() (lock txq->_xmit_lock at some point)
      
      We could add lockdep annotations, but we also can make sure IPv6
      calls icmpv6_param_prob() only after the release of the frag queue spinlock,
      since this naturally makes frag queue spinlock a leaf in lock hierarchy.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      415787d7
    • S
      hv_netvsc: propogate Hyper-V friendly name into interface alias · 0fe554a4
      Stephen Hemminger 提交于
      This patch implement the 'Device Naming' feature of the Hyper-V
      network device API. In Hyper-V on the host through the GUI or PowerShell
      it is possible to enable the device naming feature which causes
      the host to make available to the guest the name of the device.
      This shows up in the RNDIS protocol as the friendly name.
      
      The name has no particular meaning and is limited to 256 characters.
      The value can only be set via PowerShell on the host, but could
      be scripted for mass deployments. The default value is the
      string 'Network Adapter' and since that is the same for all devices
      and useless, the driver ignores it.
      
      In Windows, the value goes into a registry key for use in SNMP
      ifAlias. For Linux, this patch puts the value in the network
      device alias property; where it is visible in ip tools and SNMP.
      
      The host provided ifAlias is just a suggestion, and can be
      overridden by later ip commands.
      
      Also requires exporting dev_set_alias in netdev core.
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fe554a4
  3. 18 4月, 2018 25 次提交
    • D
      net/ipv6: Remove unused code and variables for rt6_info · 77634cc6
      David Ahern 提交于
      Drop unneeded elements from rt6_info struct and rearrange layout to
      something more relevant for the data path.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77634cc6
    • D
      net/ipv6: Flip FIB entries to fib6_info · 8d1c802b
      David Ahern 提交于
      Convert all code paths referencing a FIB entry from
      rt6_info to fib6_info.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d1c802b
    • D
      net/ipv6: separate handling of FIB entries from dst based routes · 93531c67
      David Ahern 提交于
      Last step before flipping the data type for FIB entries:
      - use fib6_info_alloc to create FIB entries in ip6_route_info_create
        and addrconf_dst_alloc
      - use fib6_info_release in place of dst_release, ip6_rt_put and
        rt6_release
      - remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt
      - when purging routes, drop per-cpu routes
      - replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release
      - use rt->from since it points to the FIB entry
      - drop references to exception bucket, fib6_metrics and per-cpu from
        dst entries (those are relevant for fib entries only)
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93531c67
    • D
      net/ipv6: introduce fib6_info struct and helpers · a64efe14
      David Ahern 提交于
      Add fib6_info struct and alloc, destroy, hold and release helpers.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a64efe14
    • D
      net/ipv6: Cleanup exception and cache route handling · 23fb93a4
      David Ahern 提交于
      IPv6 FIB will only contain FIB entries with exception routes added to
      the FIB entry. Once this transformation is complete, FIB lookups will
      return a fib6_info with the lookup functions still returning a dst
      based rt6_info. The current code uses rt6_info for both paths and
      overloads the rt6_info variable usually called 'rt'.
      
      This patch introduces a new 'f6i' variable name for the result of the FIB
      lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a
      fib6_info in a later patch which is why it is introduced as f6i now;
      avoids the additional churn in the later patch.
      
      In addition, remove RTF_CACHE and dst checks from fib6 add and delete
      since they can not happen now and will never happen after the data
      type flip.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23fb93a4
    • D
      net/ipv6: Add gfp_flags to route add functions · acb54e3c
      David Ahern 提交于
      Most FIB entries can be added using memory allocated with GFP_KERNEL.
      Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
      can be reached from the packet path (e.g., ndisc and autoconfig) or
      atomic notifiers use GFP_ATOMIC; paths from user context (adding
      addresses and routes) use GFP_KERNEL.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb54e3c
    • D
      net/ipv6: Create a neigh_lookup for FIB entries · f8a1b43b
      David Ahern 提交于
      The router discovery code has a FIB entry and wants to validate the
      gateway has a neighbor entry. Refactor the existing dst_neigh_lookup
      for IPv6 and create a new function that takes the gateway and device
      and returns a neighbor entry. Use the new function in
      ndisc_router_discovery to validate the gateway.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8a1b43b
    • D
      net/ipv6: Move dst flags to booleans in fib entries · 3b6761d1
      David Ahern 提交于
      Continuing to wean FIB paths off of dst_entry, use a bool to hold
      requests for certain dst settings. Add a helper to convert the
      flags to DST flags when a FIB entry is converted to a dst_entry.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b6761d1
    • D
      net/ipv6: Add rt6_info create function for ip6_pol_route_lookup · dec9b0e2
      David Ahern 提交于
      ip6_pol_route_lookup is the lookup function for ip6_route_lookup and
      rt6_lookup. At the moment it returns either a reference to a FIB entry
      or a cached exception. To move FIB entries to a separate struct, this
      lookup function needs to convert FIB entries to an rt6_info that is
      returned to the caller.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dec9b0e2
    • D
      net/ipv6: Add fib6_null_entry · 421842ed
      David Ahern 提交于
      ip6_null_entry will stay a dst based return for lookups that fail to
      match an entry.
      
      Add a new fib6_null_entry which constitutes the root node and leafs
      for fibs. Replace existing references to ip6_null_entry with the
      new fib6_null_entry when dealing with FIBs.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      421842ed
    • D
      net/ipv6: move expires into rt6_info · 14895687
      David Ahern 提交于
      Add expires to rt6_info for FIB entries, and add fib6 helpers to
      manage it. Data path use of dst.expires remains.
      
      The transition is fairly straightforward: when working with fib entries,
      rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with
      fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and
      rt6_check_expired becomes fib6_check_expired, where the fib6 versions
      are added by this patch.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14895687
    • D
      net/ipv6: move metrics from dst to rt6_info · d4ead6b3
      David Ahern 提交于
      Similar to IPv4, add fib metrics to the fib struct, which at the moment
      is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
      into dst by reference using refcount.
      
      To make the transition:
      - add dst_metrics to rt6_info. Default to dst_default_metrics if no
        metrics are passed during route add. No need for a separate pmtu
        entry; it can reference the MTU slot in fib6_metrics
      
      - ip6_convert_metrics allocates memory in the FIB entry and uses
        ip_metrics_convert to copy from netlink attribute to metrics entry
      
      - the convert metrics call is done in ip6_route_info_create simplifying
        the route add path
        + fib6_commit_metrics and fib6_copy_metrics and the temporary
          mx6_config are no longer needed
      
      - add fib6_metric_set helper to change the value of a metric in the
        fib entry since dst_metric_set can no longer be used
      
      - cow_metrics for IPv6 can drop to dst_cow_metrics_generic
      
      - rt6_dst_from_metrics_check is no longer needed
      
      - rt6_fill_node needs the FIB entry and dst as separate arguments to
        keep compatibility with existing output. Current dst address is
        renamed to dest.
        (to be consistent with IPv4 rt6_fill_node really should be split
        into 2 functions similar to fib_dump_info and rt_fill_info)
      
      - rt6_fill_node no longer needs the temporary metrics variable
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4ead6b3
    • D
      net/ipv6: Defer initialization of dst to data path · 6edb3c96
      David Ahern 提交于
      Defer setting dst input, output and error until fib entry is copied.
      
      The reject path from ip6_route_info_create is moved to a new function
      ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type
      to dst error.
      
      The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code
      from addrconf_dst_alloc and the non-reject path of ip6_route_info_create.
      The dst output function is always ip6_output and the input function is
      either ip6_input (local routes), ip6_mc_input (multicast routes) or
      ip6_forward (anything else).
      
      A couple of places using dst.error are updated to look at rt6i_flags.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6edb3c96
    • D
      net/ipv6: Move nexthop data to fib6_nh · 5e670d84
      David Ahern 提交于
      Introduce fib6_nh structure and move nexthop related data from
      rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
      lwtstate from a FIB lookup perspective are converted to use fib6_nh;
      datapath references to dst version are left as is.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e670d84
    • D
      net/ipv6: Save route type in rt6_info · e8478e80
      David Ahern 提交于
      The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags
      and dst.error. Since dst is going to be removed, it can no longer be
      relied on for FIB dumps so save the route type as fib6_type.
      
      fc_type is set in current users based on the algorithm in rt6_fill_node:
        - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL
        - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST
        - else fc_type = RTN_UNICAST
      
      Similarly, fib6_type is set in the rt6_info templates based on the
      RTF_REJECT section of rt6_fill_node converting dst.error to RTN type.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8478e80
    • D
      net/ipv6: Move support functions up in route.c · ae90d867
      David Ahern 提交于
      Code move only.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae90d867
    • D
      net/ipv6: Pass net namespace to route functions · afb1d4b5
      David Ahern 提交于
      Pass network namespace reference into route add, delete and get
      functions.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afb1d4b5
    • D
      net/ipv6: Pass net to fib6_update_sernum · 7aef6859
      David Ahern 提交于
      Pass net namespace to fib6_update_sernum. It can not be marked const
      as fib6_new_sernum will change ipv6.fib6_sernum.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7aef6859
    • D
      net: Handle null dst in rtnl_put_cacheinfo · 3940746d
      David Ahern 提交于
      Need to keep expires time for IPv6 routes in a dump of FIB entries.
      Update rtnl_put_cacheinfo to allow dst to be NULL in which case
      rta_cacheinfo will only contain non-dst data.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3940746d
    • D
      net: Move fib_convert_metrics to metrics file · a919525a
      David Ahern 提交于
      Move logic of fib_convert_metrics into ip_metrics_convert. This allows
      the code that converts netlink attributes into metrics struct to be
      re-used in a later patch by IPv6.
      
      This is mostly a code move with the following changes to variable names:
        - fi->fib_net becomes net
        - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
        - metrics array is passed as an input from fi->fib_metrics->metrics
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a919525a
    • L
      ipv6: send netlink notifications for manually configured addresses · a2d481b3
      Lorenzo Bianconi 提交于
      Send a netlink notification when userspace adds a manually configured
      address if DAD is enabled and optimistic flag isn't set.
      Moreover send RTM_DELADDR notifications for tentative addresses.
      
      Some userspace applications (e.g. NetworkManager) are interested in
      addr netlink events albeit the address is still in tentative state,
      however events are not sent if DAD process is not completed.
      If the address is added and immediately removed userspace listeners
      are not notified. This behaviour can be easily reproduced by using
      veth interfaces:
      
      $ ip -b - <<EOF
      > link add dev vm1 type veth peer name vm2
      > link set dev vm1 up
      > link set dev vm2 up
      > addr add 2001:db8:a:b:1:2:3:4/64 dev vm1
      > addr del 2001:db8:a:b:1:2:3:4/64 dev vm1
      EOF
      
      This patch reverts the behaviour introduced by the commit f784ad3d
      ("ipv6: do not send RTM_DELADDR for tentative addresses")
      Suggested-by: NThomas Haller <thaller@redhat.com>
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2d481b3
    • S
      net/ncsi: Refactor MAC, VLAN filters · 062b3e1b
      Samuel Mendoza-Jonas 提交于
      The NCSI driver defines a generic ncsi_channel_filter struct that can be
      used to store arbitrarily formatted filters, and several generic methods
      of accessing data stored in such a filter.
      However in both the driver and as defined in the NCSI specification
      there are only two actual filters: VLAN ID filters and MAC address
      filters. The splitting of the MAC filter into unicast, multicast, and
      mixed is also technically not necessary as these are stored in the same
      location in hardware.
      
      To save complexity, particularly in the set up and accessing of these
      generic filters, remove them in favour of two specific structs. These
      can be acted on directly and do not need several generic helper
      functions to use.
      
      This also fixes a memory error found by KASAN on ARM32 (which is not
      upstream yet), where response handlers accessing a filter's data field
      could write past allocated memory.
      
      [  114.926512] ==================================================================
      [  114.933861] BUG: KASAN: slab-out-of-bounds in ncsi_configure_channel+0x4b8/0xc58
      [  114.941304] Read of size 2 at addr 94888558 by task kworker/0:2/546
      [  114.947593]
      [  114.949146] CPU: 0 PID: 546 Comm: kworker/0:2 Not tainted 4.16.0-rc6-00119-ge156398bfcad #13
      ...
      [  115.170233] The buggy address belongs to the object at 94888540
      [  115.170233]  which belongs to the cache kmalloc-32 of size 32
      [  115.181917] The buggy address is located 24 bytes inside of
      [  115.181917]  32-byte region [94888540, 94888560)
      [  115.192115] The buggy address belongs to the page:
      [  115.196943] page:9eeac100 count:1 mapcount:0 mapping:94888000 index:0x94888fc1
      [  115.204200] flags: 0x100(slab)
      [  115.207330] raw: 00000100 94888000 94888fc1 0000003f 00000001 9eea2014 9eecaa74 96c003e0
      [  115.215444] page dumped because: kasan: bad access detected
      [  115.221036]
      [  115.222544] Memory state around the buggy address:
      [  115.227384]  94888400: fb fb fb fb fc fc fc fc 04 fc fc fc fc fc fc fc
      [  115.233959]  94888480: 00 00 00 fc fc fc fc fc 00 04 fc fc fc fc fc fc
      [  115.240529] >94888500: 00 00 04 fc fc fc fc fc 00 00 04 fc fc fc fc fc
      [  115.247077]                                             ^
      [  115.252523]  94888580: 00 04 fc fc fc fc fc fc 06 fc fc fc fc fc fc fc
      [  115.259093]  94888600: 00 00 06 fc fc fc fc fc 00 00 04 fc fc fc fc fc
      [  115.265639] ==================================================================
      Reported-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      062b3e1b
    • E
      KEYS: DNS: limit the length of option strings · c210f7b4
      Eric Biggers 提交于
      Adding a dns_resolver key whose payload contains a very long option name
      resulted in that string being printed in full.  This hit the WARN_ONCE()
      in set_precision() during the printk(), because printk() only supports a
      precision of up to 32767 bytes:
      
          precision 1000000 too large
          WARNING: CPU: 0 PID: 752 at lib/vsprintf.c:2189 vsnprintf+0x4bc/0x5b0
      
      Fix it by limiting option strings (combined name + value) to a much more
      reasonable 128 bytes.  The exact limit is arbitrary, but currently the
      only recognized option is formatted as "dnserror=%lu" which fits well
      within this limit.
      
      Also ratelimit the printks.
      
      Reproducer:
      
          perl -e 'print "#", "A" x 1000000, "\x00"' | keyctl padd dns_resolver desc @s
      
      This bug was found using syzkaller.
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Fixes: 4a2d7892 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c210f7b4
    • S
      ipv6: Count interface receive statistics on the ingress netdev · bdb7cc64
      Stephen Suryaputra 提交于
      The statistics such as InHdrErrors should be counted on the ingress
      netdev rather than on the dev from the dst, which is the egress.
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdb7cc64
    • D
      net/ipv6: Make __inet6_bind static · 032234d8
      David Ahern 提交于
      BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is
      not invoked directly outside of af_inet6.c. Make it static and move
      inet6_bind after to avoid forward declaration.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032234d8
  4. 17 4月, 2018 9 次提交
    • J
      xdp: avoid leaking info stored in frame data on page reuse · 6dfb970d
      Jesper Dangaard Brouer 提交于
      The bpf infrastructure and verifier goes to great length to avoid
      bpf progs leaking kernel (pointer) info.
      
      For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
      kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
      Checks are in place to assure enough headroom is available for this.
      
      This info is not cleared, and if the frame is reused, then a
      malicious user could use bpf_xdp_adjust_head helper to move
      xdp->data into this area.  Thus, making this area readable.
      
      This is not super critical as XDP progs requires root or
      CAP_SYS_ADMIN, which are privileged enough for such info.  An
      effort (is underway) towards moving networking bpf hooks to the
      lesser privileged mode CAP_NET_ADMIN, where leaking such info
      should be avoided.  Thus, this patch to clear the info when
      needed.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfb970d
    • J
      xdp: transition into using xdp_frame for ndo_xdp_xmit · 44fa2dbd
      Jesper Dangaard Brouer 提交于
      Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
      xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.
      
      This builds towards changing the API further to become a bulk API,
      because xdp_buff is not a queue-able object while xdp_frame is.
      
      V4: Adjust for commit 59655a5b ("tuntap: XDP_TX can use native XDP")
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44fa2dbd
    • J
      xdp: transition into using xdp_frame for return API · 03993094
      Jesper Dangaard Brouer 提交于
      Changing API xdp_return_frame() to take struct xdp_frame as argument,
      seems like a natural choice. But there are some subtle performance
      details here that needs extra care, which is a deliberate choice.
      
      When de-referencing xdp_frame on a remote CPU during DMA-TX
      completion, result in the cache-line is change to "Shared"
      state. Later when the page is reused for RX, then this xdp_frame
      cache-line is written, which change the state to "Modified".
      
      This situation already happens (naturally) for, virtio_net, tun and
      cpumap as the xdp_frame pointer is the queued object.  In tun and
      cpumap, the ptr_ring is used for efficiently transferring cache-lines
      (with pointers) between CPUs. Thus, the only option is to
      de-referencing xdp_frame.
      
      It is only the ixgbe driver that had an optimization, in which it can
      avoid doing the de-reference of xdp_frame.  The driver already have
      TX-ring queue, which (in case of remote DMA-TX completion) have to be
      transferred between CPUs anyhow.  In this data area, we stored a
      struct xdp_mem_info and a data pointer, which allowed us to avoid
      de-referencing xdp_frame.
      
      To compensate for this, a prefetchw is used for telling the cache
      coherency protocol about our access pattern.  My benchmarks show that
      this prefetchw is enough to compensate the ixgbe driver.
      
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
      and offset in dma_sync call")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03993094
    • J
      xdp: allow page_pool as an allocator type in xdp_return_frame · 57d0a1c1
      Jesper Dangaard Brouer 提交于
      New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
      
      The registered allocator page_pool pointer is not available directly
      from xdp_rxq_info, but it could be (if needed).  For now, the driver
      should keep separate track of the page_pool pointer, which it should
      use for RX-ring page allocation.
      
      As suggested by Saeed, to maintain a symmetric API it is the drivers
      responsibility to allocate/create and free/destroy the page_pool.
      Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
      responsibility to free the page_pool, but with a RCU free call.  This
      is done easily via the page_pool helper page_pool_destroy() (which
      avoids touching any driver code during the RCU callback, which could
      happen after the driver have been unloaded).
      
      V8: address issues found by kbuild test robot
       - Address sparse should be static warnings
       - Allow xdp.o to be compiled without page_pool.o
      
      V9: Remove inline from .c file, compiler knows best
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57d0a1c1
    • J
      page_pool: refurbish version of page_pool code · ff7d6b27
      Jesper Dangaard Brouer 提交于
      Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
      pages on DMA-TX completion time, which have good cross CPU
      performance, given DMA-TX completion time can happen on a remote CPU.
      
      Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
      Adapted page_pool code to not depend the page allocator and
      integration into struct page.  The DMA mapping feature is kept,
      even-though it will not be activated/used in this patchset.
      
      [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes, don't return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
      
      V4: many small improvements and cleanups
      - Add DOC comment section, that can be used by kernel-doc
      - Improve fallback mode, to work better with refcnt based recycling
        e.g. remove a WARN as pointed out by Tariq
        e.g. quicker fallback if ptr_ring is empty.
      
      V5: Fixed SPDX license as pointed out by Alexei
      
      V6: Adjustments requested by Eric Dumazet
       - Adjust ____cacheline_aligned_in_smp usage/placement
       - Move rcu_head in struct page_pool
       - Free pages quicker on destroy, minimize resources delayed an RCU period
       - Remove code for forward/backward compat ABI interface
      
      V8: Issues found by kbuild test robot
       - Address sparse should be static warnings
       - Only compile+link when a driver use/select page_pool,
         mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff7d6b27
    • J
      xdp: rhashtable with allocator ID to pointer mapping · 8d5d8852
      Jesper Dangaard Brouer 提交于
      Use the IDA infrastructure for getting a cyclic increasing ID number,
      that is used for keeping track of each registered allocator per
      RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
      uses a radix tree, use a dynamic rhashtable, for creating ID to
      pointer lookup table, because this is faster.
      
      The problem that is being solved here is that, the xdp_rxq_info
      pointer (stored in xdp_buff) cannot be used directly, as the
      guaranteed lifetime is too short.  The info is needed on a
      (potentially) remote CPU during DMA-TX completion time . In an
      xdp_frame the xdp_mem_info is stored, when it got converted from an
      xdp_buff, which is sufficient for the simple page refcnt based recycle
      schemes.
      
      For more advanced allocators there is a need to store a pointer to the
      registered allocator.  Thus, there is a need to guard the lifetime or
      validity of the allocator pointer, which is done through this
      rhashtable ID map to pointer. The removal and validity of of the
      allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
      allocator will be created by the driver, and registered with
      xdp_rxq_info_reg_mem_model().
      
      It is up-to debate who is responsible for freeing the allocator
      pointer or invoking the allocator destructor function.  In any case,
      this must happen via RCU freeing.
      
      Use the IDA infrastructure for getting a cyclic increasing ID number,
      that is used for keeping track of each registered allocator per
      RX-queue xdp_rxq_info.
      
      V4: Per req of Jason Wang
      - Use xdp_rxq_info_reg_mem_model() in all drivers implementing
        XDP_REDIRECT, even-though it's not strictly necessary when
        allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
      
      V6: Per req of Alex Duyck
      - Introduce rhashtable_lookup() call in later patch
      
      V8: Address sparse should be static warnings (from kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d5d8852
    • J
      xdp: introduce xdp_return_frame API and use in cpumap · 5ab073ff
      Jesper Dangaard Brouer 提交于
      Introduce an xdp_return_frame API, and convert over cpumap as
      the first user, given it have queued XDP frame structure to leverage.
      
      V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
      V6: Remove comment that id will be added later (Req by Alex Duyck)
      V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ab073ff
    • E
      tcp: implement mmap() for zero copy receive · 93ab6cc6
      Eric Dumazet 提交于
      Some networks can make sure TCP payload can exactly fit 4KB pages,
      with well chosen MSS/MTU and architectures.
      
      Implement mmap() system call so that applications can avoid
      copying data without complex splice() games.
      
      Note that a successful mmap( X bytes) on TCP socket is consuming
      bytes, as if recvmsg() has been done. (tp->copied += X)
      
      Only PROT_READ mappings are accepted, as skb page frags
      are fundamentally shared and read only.
      
      If tcp_mmap() finds data that is not a full page, or a patch of
      urgent data, -EINVAL is returned, no bytes are consumed.
      
      Application must fallback to recvmsg() to read the problematic sequence.
      
      mmap() wont block,  regardless of socket being in blocking or
      non-blocking mode. If not enough bytes are in receive queue,
      mmap() would return -EAGAIN, or -EIO if socket is in a state
      where no other bytes can be added into receive queue.
      
      An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
      to efficiently use mmap()
      
      On the sender side, MSG_EOR might help to clearly separate unaligned
      headers and 4K-aligned chunks if necessary.
      
      Tested:
      
      mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
      MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
      
      Without mmap() (tcp_mmap -s)
      
      received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
        cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
      received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
        cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
      received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
        cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
      received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
        cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
      
      With mmap() on receiver (tcp_mmap -s -z)
      
      received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
        cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
        cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
        cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
      received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
        cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ab6cc6
    • E
      tcp: avoid extra wakeups for SO_RCVLOWAT users · 03f45c88
      Eric Dumazet 提交于
      SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
      generated when enough bytes are available in receive queue, after
      David change (commit c7004482 "tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      But TCP still calls sk->sk_data_ready() for each chunk added in receive
      queue, meaning thread is awaken, and goes back to sleep shortly after.
      
      Tested:
      
      tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
      
      -> Should get ~2 wakeups (c-switches) per MB, regardless of how many
      (tiny or big) packets were received.
      
      High speed (mostly full size GRO packets)
      
      received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
        cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
      
      received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
        cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
      
      Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
      
      received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
        cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03f45c88