1. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  2. 25 12月, 2016 1 次提交
  3. 24 12月, 2016 1 次提交
  4. 08 11月, 2016 1 次提交
  5. 04 11月, 2016 1 次提交
    • W
      ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248
      Willem de Bruijn 提交于
      The IP stack records the largest fragment of a reassembled packet
      in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
      that arrived fragmented, expose the value to allow applications to
      estimate receive path MTU.
      
      Tested:
        Sent data over a veth pair of which the source has a small mtu.
        Sent data using netcat, received using a dedicated process.
      
        Verified that the cmsg IP_RECVFRAGSIZE is returned only when
        data arrives fragmented, and in that cases matches the veth mtu.
      
          ip link add veth0 type veth peer name veth1
      
          ip netns add from
          ip netns add to
      
          ip link set dev veth1 netns to
          ip netns exec to ip addr add dev veth1 192.168.10.1/24
          ip netns exec to ip link set dev veth1 up
      
          ip link set dev veth0 netns from
          ip netns exec from ip addr add dev veth0 192.168.10.2/24
          ip netns exec from ip link set dev veth0 up
          ip netns exec from ip link set dev veth0 mtu 1300
          ip netns exec from ethtool -K veth0 ufo off
      
          dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
      
          ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
          ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
      
        using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ecc248
  6. 27 10月, 2016 1 次提交
    • E
      udp: fix IP_CHECKSUM handling · 10df8e61
      Eric Dumazet 提交于
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10df8e61
  7. 09 9月, 2016 1 次提交
  8. 17 5月, 2016 1 次提交
  9. 12 5月, 2016 1 次提交
    • D
      net: original ingress device index in PKTINFO · 0b922b7a
      David Ahern 提交于
      Applications such as OSPF and BFD need the original ingress device not
      the VRF device; the latter can be derived from the former. To that end
      add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
      the skb control buffer similar to IPv6. From there the pktinfo can just
      pull it from cb with the PKTINFO_SKB_CB cast.
      
      The previous patch moving the skb->dev change to L3 means nothing else
      is needed for IPv6; it just works.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b922b7a
  10. 26 4月, 2016 1 次提交
  11. 14 4月, 2016 1 次提交
  12. 08 4月, 2016 1 次提交
  13. 05 4月, 2016 1 次提交
  14. 17 2月, 2016 1 次提交
  15. 13 2月, 2016 1 次提交
  16. 11 2月, 2016 1 次提交
  17. 05 11月, 2015 1 次提交
  18. 24 6月, 2015 1 次提交
  19. 07 6月, 2015 1 次提交
    • E
      inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations · 90c337da
      Eric Dumazet 提交于
      When an application needs to force a source IP on an active TCP socket
      it has to use bind(IP, port=x).
      
      As most applications do not want to deal with already used ports, x is
      often set to 0, meaning the kernel is in charge to find an available
      port.
      But kernel does not know yet if this socket is going to be a listener or
      be connected.
      It has very limited choices (no full knowledge of final 4-tuple for a
      connect())
      
      With limited ephemeral port range (about 32K ports), it is very easy to
      fill the space.
      
      This patch adds a new SOL_IP socket option, asking kernel to ignore
      the 0 port provided by application in bind(IP, port=0) and only
      remember the given IP address.
      
      The port will be automatically chosen at connect() time, in a way
      that allows sharing a source port as long as the 4-tuples are unique.
      
      This new feature is available for both IPv4 and IPv6 (Thanks Neal)
      
      Tested:
      
      Wrote a test program and checked its behavior on IPv4 and IPv6.
      
      strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
      connect().
      Also getsockname() show that the port is still 0 right after bind()
      but properly allocated after connect().
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
      setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      
      IPv6 test :
      
      socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
      setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      
      I was able to bind()/connect() a million concurrent IPv4 sockets,
      instead of ~32000 before patch.
      
      lpaa23:~# ulimit -n 1000010
      lpaa23:~# ./bind --connect --num-flows=1000000 &
      1000000 sockets
      
      lpaa23:~# grep TCP /proc/net/sockstat
      TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
      
      Check that a given source port is indeed used by many different
      connections :
      
      lpaa23:~# ss -t src :40000 | head -10
      State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
      ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
      ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
      ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
      ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
      ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
      ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
      ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90c337da
  20. 04 4月, 2015 2 次提交
  21. 19 3月, 2015 2 次提交
  22. 09 3月, 2015 1 次提交
    • W
      ip: fix error queue empty skb handling · c247f053
      Willem de Bruijn 提交于
      When reading from the error queue, msg_name and msg_control are only
      populated for some errors. A new exception for empty timestamp skbs
      added a false positive on icmp errors without payload.
      
      `traceroute -M udpconn` only displayed gateways that return payload
      with the icmp error: the embedded network headers are pulled before
      sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise.
      
      Fix this regression by refining when msg_name and msg_control
      branches are taken. The solutions for the two fields are independent.
      
      msg_name only makes sense for errors that configure serr->port and
      serr->addr_offset. Test the first instead of skb->len. This also fixes
      another issue. saddr could hold the wrong data, as serr->addr_offset
      is not initialized  in some code paths, pointing to the start of the
      network header. It is only valid when serr->port is set (non-zero).
      
      msg_control support differs between IPv4 and IPv6. IPv4 only honors
      requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The
      skb->len test can simply be removed, because skb->dev is also tested
      and never true for empty skbs. IPv6 honors requests for all errors
      aside from local errors and timestamps on empty skbs.
      
      In both cases, make the policy more explicit by moving this logic to
      a new function that decides whether to process msg_control and that
      optionally prepares the necessary fields in skb->cb[]. After this
      change, the IPv4 and IPv6 paths are more similar.
      
      The last case is rxrpc. Here, simply refine to only match timestamps.
      
      Fixes: 49ca0d8b ("net-timestamp: no-payload option")
      Reported-by: NJan Niehusmann <jan@gondor.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1->v2
        - fix local origin test inversion in ip6_datagram_support_cmsg
        - make v4 and v6 code paths more similar by introducing analogous
          ipv4_datagram_support_cmsg
        - fix compile bug in rxrpc
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c247f053
  23. 03 2月, 2015 1 次提交
    • W
      net-timestamp: no-payload option · 49ca0d8b
      Willem de Bruijn 提交于
      Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
      timestamps, this loops timestamps on top of empty packets.
      
      Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
      cmsg reception (aside from timestamps) are no longer possible. This
      works together with a follow on patch that allows administrators to
      only allow tx timestamping if it does not loop payload or metadata.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes (rfc -> v1)
        - add documentation
        - remove unnecessary skb->len test (thanks to Richard Cochran)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ca0d8b
  24. 16 1月, 2015 1 次提交
  25. 06 1月, 2015 3 次提交
  26. 11 12月, 2014 1 次提交
  27. 09 12月, 2014 2 次提交
    • W
      net-timestamp: allow reading recv cmsg on errqueue with origin tstamp · 829ae9d6
      Willem de Bruijn 提交于
      Allow reading of timestamps and cmsg at the same time on all relevant
      socket families. One use is to correlate timestamps with egress
      device, by asking for cmsg IP_PKTINFO.
      
      on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
      avoid changing legacy expectations, only do so if the caller sets a
      new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
      
      on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
      returned for all origins. only change is to set ifindex, which is
      not initialized for all error origins.
      
      In both cases, only generate the pktinfo message if an ifindex is
      known. This is not the case for ACK timestamps.
      
      The difference between the protocol families is probably a historical
      accident as a result of the different conditions for generating cmsg
      in the relevant ip(v6)_recv_error function:
      
      ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
      ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
      
      At one time, this was the same test bar for the ICMP/ICMP6
      distinction. This is no longer true.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1 -> v2
          large rewrite
          - integrate with existing pktinfo cmsg generation code
          - on ipv4: only send with new flag, to maintain legacy behavior
          - on ipv6: send at most a single pktinfo cmsg
          - on ipv6: initialize fields if not yet initialized
      
      The recv cmsg interfaces are also relevant to the discussion of
      whether looping packet headers is problematic. For v6, cmsgs that
      identify many headers are already returned. This patch expands
      that to v4. If it sounds reasonable, I will follow with patches
      
      1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
         (http://patchwork.ozlabs.org/patch/366967/)
      2. sysctl to conditionally drop all timestamps that have payload or
         cmsg from users without CAP_NET_RAW.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829ae9d6
    • W
      ipv4: warn once on passing AF_INET6 socket to ip_recv_error · 7ce875e5
      Willem de Bruijn 提交于
      One line change, in response to catching an occurrence of this bug.
      See also fix f4713a3d ("net-timestamp: make tcp_recvmsg call ...")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce875e5
  28. 12 11月, 2014 1 次提交
  29. 06 11月, 2014 1 次提交
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
  30. 10 9月, 2014 1 次提交
  31. 02 9月, 2014 1 次提交
    • W
      sock: deduplicate errqueue dequeue · 364a9e93
      Willem de Bruijn 提交于
      sk->sk_error_queue is dequeued in four locations. All share the
      exact same logic. Deduplicate.
      
      Also collapse the two critical sections for dequeue (at the top of
      the recv handler) and signal (at the bottom).
      
      This moves signal generation for the next packet forward, which should
      be harmless.
      
      It also changes the behavior if the recv handler exits early with an
      error. Previously, a signal for follow-up packets on the errqueue
      would then not be scheduled. The new behavior, to always signal, is
      arguably a bug fix.
      
      For rxrpc, the change causes the same function to be called repeatedly
      for each queued packet (because the recv handler == sk_error_report).
      It is likely that all packets will fail for the same reason (e.g.,
      memory exhaustion).
      
      This code runs without sk_lock held, so it is not safe to trust that
      sk->sk_err is immutable inbetween releasing q->lock and the subsequent
      test. Introduce int err just to avoid this potential race.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      364a9e93
  32. 30 7月, 2014 1 次提交
  33. 27 2月, 2014 1 次提交
    • H
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa 提交于
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b346576
  34. 20 2月, 2014 1 次提交
  35. 20 1月, 2014 1 次提交
    • H
      ipv6: make IPV6_RECVPKTINFO work for ipv4 datagrams · 4b261c75
      Hannes Frederic Sowa 提交于
      We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
      for IPv4 datagrams on IPv6 sockets.
      
      This patch splits the ip6_datagram_recv_ctl into two functions, one
      which handles both protocol families, AF_INET and AF_INET6, while the
      ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.
      
      ip6_datagram_recv_*_ctl never reported back any errors, so we can make
      them return void. Also provide a helper for protocols which don't offer dual
      personality to further use ip6_datagram_recv_ctl, which is exported to
      modules.
      
      I needed to shuffle the code for ping around a bit to make it easier to
      implement dual personality for ping ipv6 sockets in future.
      Reported-by: NGert Doering <gert@space.net>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b261c75