1. 08 11月, 2016 1 次提交
  2. 05 11月, 2016 2 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
    • L
      net: core: add UID to flows, rules, and routes · 622ec2c9
      Lorenzo Colitti 提交于
      - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
        range of UIDs.
      - Define a RTA_UID attribute for per-UID route lookups and dumps.
      - Support passing these attributes to and from userspace via
        rtnetlink. The value INVALID_UID indicates no UID was
        specified.
      - Add a UID field to the flow structures.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      622ec2c9
  3. 04 11月, 2016 3 次提交
    • W
      ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248
      Willem de Bruijn 提交于
      The IP stack records the largest fragment of a reassembled packet
      in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
      that arrived fragmented, expose the value to allow applications to
      estimate receive path MTU.
      
      Tested:
        Sent data over a veth pair of which the source has a small mtu.
        Sent data using netcat, received using a dedicated process.
      
        Verified that the cmsg IP_RECVFRAGSIZE is returned only when
        data arrives fragmented, and in that cases matches the veth mtu.
      
          ip link add veth0 type veth peer name veth1
      
          ip netns add from
          ip netns add to
      
          ip link set dev veth1 netns to
          ip netns exec to ip addr add dev veth1 192.168.10.1/24
          ip netns exec to ip link set dev veth1 up
      
          ip link set dev veth0 netns from
          ip netns exec from ip addr add dev veth0 192.168.10.2/24
          ip netns exec from ip link set dev veth0 up
          ip netns exec from ip link set dev veth0 mtu 1300
          ip netns exec from ethtool -K veth0 ufo off
      
          dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
      
          ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
          ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
      
        using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ecc248
    • C
      net: ip, raw_diag -- Use jump for exiting from nested loop · 9999370f
      Cyrill Gorcunov 提交于
      I managed to miss that sk_for_each is called under "for"
      cycle so need to use goto here to return matching socket.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9999370f
    • C
      net: ip, raw_diag -- Fix socket leaking for destroy request · cd05a0ec
      Cyrill Gorcunov 提交于
      In raw_diag_destroy the helper raw_sock_get returns
      with sock_hold call, so we have to put it then.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd05a0ec
  4. 03 11月, 2016 2 次提交
    • C
      net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect · 3de864f8
      Cyrill Gorcunov 提交于
      While being preparing patches for killing raw sockets via
      diag netlink interface I noticed that my runs are stuck:
      
       | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
       | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
       | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
       | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
       | [<ffffffff8179b517>] raw_abort+0x33/0x42
       | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52
      
      which has not been the case before. I narrowed it down to the commit
      
       | commit 286c72de
       | Author: Eric Dumazet <edumazet@google.com>
       | Date:   Thu Oct 20 09:39:40 2016 -0700
       |
       |     udp: must lock the socket in udp_disconnect()
      
      where we start locking the socket for different reason.
      
      So the raw_abort escaped the renaming and we have to
      fix this typo using __udp_disconnect instead.
      
      Fixes: 286c72de ("udp: must lock the socket in udp_disconnect()")
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3de864f8
    • E
      tcp: enhance tcp collapsing · 2331ccc5
      Eric Dumazet 提交于
      As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
      time even if the skb at the right has fragments.
      
      We simply have to use more generic skb_copy_bits() instead of
      skb_copy_from_linear_data() in tcp_collapse_retrans()
      
      Also need to guard this skb_copy_bits() in case there is nothing to
      copy, otherwise skb_put() could panic if left skb has frags.
      
      Tested:
      
      Used following packetdrill test
      
      // Establish a connection.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
      // Check that TCP collapse works :
         +0 > P. 1:1001(1000) ack 1
      Reported-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2331ccc5
  5. 02 11月, 2016 2 次提交
    • P
      netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c · 8db4c5be
      Pablo Neira Ayuso 提交于
      We need this split to reuse existing codebase for the upcoming nf_tables
      socket expression.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8db4c5be
    • F
      netfilter: nf_tables: add fib expression · f6d0cbcf
      Florian Westphal 提交于
      Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
      just dispatches to ipv4 or ipv6 one based on nfproto).
      
      Currently supports fetching output interface index/name and the
      rtm_type associated with an address.
      
      This can be used for adding path filtering. rtm_type is useful
      to e.g. enforce a strong-end host model where packets
      are only accepted if daddr is configured on the interface the
      packet arrived on.
      
      The fib expression is a native nftables alternative to the
      xtables addrtype and rp_filter matches.
      
      FIB result order for oif/oifname retrieval is as follows:
       - if packet is local (skb has rtable, RTF_LOCAL set, this
         will also catch looped-back multicast packets), set oif to
         the loopback interface.
       - if fib lookup returns an error, or result points to local,
         store zero result.  This means '--local' option of -m rpfilter
         is not supported. It is possible to use 'fib type local' or add
         explicit saddr/daddr matching rules to create exceptions if this
         is really needed.
       - store result in the destination register.
         In case of multiple routes, search set for desired oif in case
         strict matching is requested.
      
      ipv4 and ipv6 behave fib expressions are supposed to behave the same.
      
      [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
      
      	http://patchwork.ozlabs.org/patch/688615/
      
        to address fallout from this patch after rebasing nf-next, that was
        posted to address compilation warnings. --pablo ]
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f6d0cbcf
  6. 01 11月, 2016 2 次提交
    • D
      net: Enable support for VRF with ipv4 multicast · e58e4159
      David Ahern 提交于
      Enable support for IPv4 multicast:
      - similar to unicast the flow struct is updated to L3 master device
        if relevant prior to calling fib_rules_lookup. The table id is saved
        to the lookup arg so the rule action for ipmr can return the table
        associated with the device.
      
      - ip_mr_forward needs to check for master device mismatch as well
        since the skb->dev is set to it
      
      - allow multicast address on VRF device for Rx by checking for the
        daddr in the VRF device as well as the original ingress device
      
      - on Tx need to drop to __mkroute_output when FIB lookup fails for
        multicast destination address.
      
      - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
        IPMR FIB rules on first device create similar to FIB rules. In
        addition the VRF driver does not divert IPv4 multicast packets:
        it breaks on Tx since the fib lookup fails on the mcast address.
      
      With this patch, ipmr forwarding and local rx/tx work.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e58e4159
    • N
  7. 30 10月, 2016 3 次提交
  8. 28 10月, 2016 3 次提交
    • J
      genetlink: mark families as __ro_after_init · 56989f6d
      Johannes Berg 提交于
      Now genl_register_family() is the only thing (other than the
      users themselves, perhaps, but I didn't find any doing that)
      writing to the family struct.
      
      In all families that I found, genl_register_family() is only
      called from __init functions (some indirectly, in which case
      I've add __init annotations to clarifly things), so all can
      actually be marked __ro_after_init.
      
      This protects the data structure from accidental corruption.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56989f6d
    • J
      genetlink: statically initialize families · 489111e5
      Johannes Berg 提交于
      Instead of providing macros/inline functions to initialize
      the families, make all users initialize them statically and
      get rid of the macros.
      
      This reduces the kernel code size by about 1.6k on x86-64
      (with allyesconfig).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      489111e5
    • J
      genetlink: no longer support using static family IDs · a07ea4d9
      Johannes Berg 提交于
      Static family IDs have never really been used, the only
      use case was the workaround I introduced for those users
      that assumed their family ID was also their multicast
      group ID.
      
      Additionally, because static family IDs would never be
      reserved by the generic netlink code, using a relatively
      low ID would only work for built-in families that can be
      registered immediately after generic netlink is started,
      which is basically only the control family (apart from
      the workaround code, which I also had to add code for so
      it would reserve those IDs)
      
      Thus, anything other than GENL_ID_GENERATE is flawed and
      luckily not used except in the cases I mentioned. Move
      those workarounds into a few lines of code, and then get
      rid of GENL_ID_GENERATE entirely, making it more robust.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a07ea4d9
  9. 27 10月, 2016 1 次提交
    • E
      udp: fix IP_CHECKSUM handling · 10df8e61
      Eric Dumazet 提交于
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10df8e61
  10. 24 10月, 2016 1 次提交
    • C
      net: ip, diag -- Add diag interface for raw sockets · 432490f9
      Cyrill Gorcunov 提交于
      In criu we are actively using diag interface to collect sockets
      present in the system when dumping applications. And while for
      unix, tcp, udp[lite], packet, netlink it works as expected,
      the raw sockets do not have. Thus add it.
      
      v2:
       - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
       - implement @destroy for diag requests (by dsa@)
      
      v3:
       - add export of raw_abort for IPv6 (by dsa@)
       - pass net-admin flag into inet_sk_diag_fill due to
         changes in net-next branch (by dsa@)
      
      v4:
       - use @pad in struct inet_diag_req_v2 for raw socket
         protocol specification: raw module carries sockets
         which may have custom protocol passed from socket()
         syscall and sole @sdiag_protocol is not enough to
         match underlied ones
       - start reporting protocol specifed in socket() call
         when sockets are raw ones for the same reason: user
         space tools like ss may parse this attribute and use
         it for socket matching
      
      v5 (by eric.dumazet@):
       - use sock_hold in raw_sock_get instead of atomic_inc,
         we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
         when looking up so counter won't be zero here.
      
      v6:
       - use sdiag_raw_protocol() helper which will access @pad
         structure used for raw sockets protocol specification:
         we can't simply rename this member without breaking uapi
      
      v7:
       - sine sdiag_raw_protocol() helper is not suitable for
         uapi lets rather make an alias structure with proper
         names. __check_inet_diag_req_raw helper will catch
         if any of structure unintentionally changed.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432490f9
  11. 23 10月, 2016 3 次提交
    • P
      udp: use it's own memory accounting schema · 850cbadd
      Paolo Abeni 提交于
      Completely avoid default sock memory accounting and replace it
      with udp-specific accounting.
      
      Since the new memory accounting model encapsulates completely
      the required locking, remove the socket lock on both enqueue and
      dequeue, and avoid using the backlog on enqueue.
      
      Be sure to clean-up rx queue memory on socket destruction, using
      udp its own sk_destruct.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr readers      Kpps (vanilla)  Kpps (patched)
      1               170             440
      3               1250            2150
      6               3000            3650
      9               4200            4450
      12              5700            6250
      
      v4 -> v5:
        - avoid unneeded test in first_packet_length
      
      v3 -> v4:
        - remove useless sk_rcvqueues_full() call
      
      v2 -> v3:
        - do not set the now unsed backlog_rcv callback
      
      v1 -> v2:
        - add memory pressure support
        - fixed dropwatch accounting for ipv6
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      850cbadd
    • P
      udp: implement memory accounting helpers · f970bd9e
      Paolo Abeni 提交于
      Avoid using the generic helpers.
      Use the receive queue spin lock to protect the memory
      accounting operation, both on enqueue and on dequeue.
      
      On dequeue perform partial memory reclaiming, trying to
      leave a quantum of forward allocated memory.
      
      On enqueue use a custom helper, to allow some optimizations:
      - use a plain spin_lock() variant instead of the slightly
        costly spin_lock_irqsave(),
      - avoid dst_force check, since the calling code has already
        dropped the skb dst
      - avoid orphaning the skb, since skb_steal_sock() already did
        the work for us
      
      The above needs custom memory reclaiming on shutdown, provided
      by the udp_destruct_sock().
      
      v5 -> v6:
        - don't orphan the skb on enqueue
      
      v4 -> v5:
        - replace the mem_lock with the receive queue spin lock
        - ensure that the bh is always allowed to enqueue at least
          a skb, even if sk_rcvbuf is exceeded
      
      v3 -> v4:
        - reworked memory accunting, simplifying the schema
        - provide an helper for both memory scheduling and enqueuing
      
      v1 -> v2:
        - use a udp specific destrctor to perform memory reclaiming
        - remove a couple of helpers, unneeded after the above cleanup
        - do not reclaim memory on dequeue if not under memory
          pressure
        - reworked the fwd accounting schema to avoid potential
          integer overflow
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f970bd9e
    • W
      ipv4: use the right lock for ping_group_range · 396a30cc
      WANG Cong 提交于
      This reverts commit a681574c
      ("ipv4: disable BH in set_ping_group_range()") because we never
      read ping_group_range in BH context (unlike local_port_range).
      
      Then, since we already have a lock for ping_group_range, those
      using ip_local_ports.lock for ping_group_range are clearly typos.
      
      We might consider to share a same lock for both ping_group_range
      and local_port_range w.r.t. space saving, but that should be for
      net-next.
      
      Fixes: a681574c ("ipv4: disable BH in set_ping_group_range()")
      Fixes: ba6b918a ("ping: move ping_group_range out of CONFIG_SYSCTL")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      396a30cc
  12. 21 10月, 2016 4 次提交
    • J
      ipv4/6: use core net MTU range checking · b96f9afe
      Jarod Wilson 提交于
      ipv4/ip_tunnel:
      - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - t_hlen
      - preserve all ndo_change_mtu checks for now to prevent regressions
      
      ipv6/ip6_tunnel:
      - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len
      - preserve all ndo_change_mtu checks for now to prevent regressions
      
      ipv6/ip6_vti:
      - min_mtu = 1280, max_mtu = 65535
      - remove redundant vti6_change_mtu
      
      ipv6/sit:
      - min_mtu = 1280, max_mtu = 0xFFF8 - t_hlen
      - remove redundant ipip6_tunnel_change_mtu
      
      CC: netdev@vger.kernel.org
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b96f9afe
    • E
      ipv4: disable BH in set_ping_group_range() · a681574c
      Eric Dumazet 提交于
      In commit 4ee3bd4a ("ipv4: disable BH when changing ip local port
      range") Cong added BH protection in set_local_port_range() but missed
      that same fix was needed in set_ping_group_range()
      
      Fixes: b8f1a556 ("udp: Add function to make source port for UDP tunnels")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NEric Salo <salo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a681574c
    • E
      udp: must lock the socket in udp_disconnect() · 286c72de
      Eric Dumazet 提交于
      Baozeng Ding reported KASAN traces showing uses after free in
      udp_lib_get_port() and other related UDP functions.
      
      A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
      
      I could write a reproducer with two threads doing :
      
      static int sock_fd;
      static void *thr1(void *arg)
      {
      	for (;;) {
      		connect(sock_fd, (const struct sockaddr *)arg,
      			sizeof(struct sockaddr_in));
      	}
      }
      
      static void *thr2(void *arg)
      {
      	struct sockaddr_in unspec;
      
      	for (;;) {
      		memset(&unspec, 0, sizeof(unspec));
      	        connect(sock_fd, (const struct sockaddr *)&unspec,
      			sizeof(unspec));
              }
      }
      
      Problem is that udp_disconnect() could run without holding socket lock,
      and this was causing list corruptions.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286c72de
    • S
      net: add recursion limit to GRO · fcd91dd4
      Sabrina Dubroca 提交于
      Currently, GRO can do unlimited recursion through the gro_receive
      handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
      to one level with encap_mark, but both VLAN and TEB still have this
      problem.  Thus, the kernel is vulnerable to a stack overflow, if we
      receive a packet composed entirely of VLAN headers.
      
      This patch adds a recursion counter to the GRO layer to prevent stack
      overflow.  When a gro_receive function hits the recursion limit, GRO is
      aborted for this skb and it is processed normally.  This recursion
      counter is put in the GRO CB, but could be turned into a percpu counter
      if we run out of space in the CB.
      
      Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.
      
      Fixes: CVE-2016-7039
      Fixes: 9b174d88 ("net: Add Transparent Ethernet Bridging GRO support.")
      Fixes: 66e5133f ("vlan: Add GRO support for non hardware accelerated vlan")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcd91dd4
  13. 20 10月, 2016 1 次提交
    • E
      tcp: relax listening_hash operations · 9652dc2e
      Eric Dumazet 提交于
      softirq handlers use RCU protection to lookup listeners,
      and write operations all happen from process context.
      We do not need to block BH for dump operations.
      
      Also SYN_RECV since request sockets are stored in the ehash table :
      
       1) inet_diag_dump_icsk() no longer need to clear
          cb->args[3] and cb->args[4] that were used as cursors while
          iterating the old per listener hash table.
      
       2) Also factorize a test : No need to scan listening_hash[]
          if r->id.idiag_dport is not zero.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9652dc2e
  14. 19 10月, 2016 1 次提交
  15. 18 10月, 2016 2 次提交
  16. 17 10月, 2016 1 次提交
    • D
      net: Require exact match for TCP socket lookups if dif is l3mdev · a04a480d
      David Ahern 提交于
      Currently, socket lookups for l3mdev (vrf) use cases can match a socket
      that is bound to a port but not a device (ie., a global socket). If the
      sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
      based on the main table even though the packet came in from an L3 domain.
      The end result is that the connection does not establish creating
      confusion for users since the service is running and a socket shows in
      ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
      skb came through an interface enslaved to an l3mdev device and the
      tcp_l3mdev_accept is not set.
      
      skb's through an l3mdev interface are marked by setting a flag in
      inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
      flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
      flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
      inet_skb_parm struct is moved in the cb per commit 971f10ec, so the
      match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
      move is done after the socket lookup, so IP6CB is used.
      
      The flags field in inet_skb_parm struct needs to be increased to add
      another flag. There is currently a 1-byte hole following the flags,
      so it can be expanded to u16 without increasing the size of the struct.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a04a480d
  17. 14 10月, 2016 1 次提交
  18. 08 10月, 2016 2 次提交
  19. 04 10月, 2016 1 次提交
    • A
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro 提交于
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25869262
  20. 30 9月, 2016 2 次提交
  21. 28 9月, 2016 2 次提交