1. 22 2月, 2017 1 次提交
  2. 05 2月, 2017 1 次提交
  3. 05 1月, 2017 1 次提交
  4. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  5. 30 12月, 2016 1 次提交
    • W
      net: fix incorrect original ingress device index in PKTINFO · f0c16ba8
      Wei Zhang 提交于
      When we send a packet for our own local address on a non-loopback
      interface (e.g. eth0), due to the change had been introduced from
      commit 0b922b7a ("net: original ingress device index in PKTINFO"), the
      original ingress device index would be set as the loopback interface.
      However, the packet should be considered as if it is being arrived via the
      sending interface (eth0), otherwise it would break the expectation of the
      userspace application (e.g. the DHCPRELEASE message from dhcp_release
      binary would be ignored by the dnsmasq daemon, since it come from lo which
      is not the interface dnsmasq bind to)
      
      Fixes: 0b922b7a ("net: original ingress device index in PKTINFO")
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NWei Zhang <asuka.com@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0c16ba8
  6. 25 12月, 2016 1 次提交
  7. 24 12月, 2016 1 次提交
  8. 08 11月, 2016 1 次提交
  9. 04 11月, 2016 1 次提交
    • W
      ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248
      Willem de Bruijn 提交于
      The IP stack records the largest fragment of a reassembled packet
      in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
      that arrived fragmented, expose the value to allow applications to
      estimate receive path MTU.
      
      Tested:
        Sent data over a veth pair of which the source has a small mtu.
        Sent data using netcat, received using a dedicated process.
      
        Verified that the cmsg IP_RECVFRAGSIZE is returned only when
        data arrives fragmented, and in that cases matches the veth mtu.
      
          ip link add veth0 type veth peer name veth1
      
          ip netns add from
          ip netns add to
      
          ip link set dev veth1 netns to
          ip netns exec to ip addr add dev veth1 192.168.10.1/24
          ip netns exec to ip link set dev veth1 up
      
          ip link set dev veth0 netns from
          ip netns exec from ip addr add dev veth0 192.168.10.2/24
          ip netns exec from ip link set dev veth0 up
          ip netns exec from ip link set dev veth0 mtu 1300
          ip netns exec from ethtool -K veth0 ufo off
      
          dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
      
          ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
          ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
      
        using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ecc248
  10. 27 10月, 2016 1 次提交
    • E
      udp: fix IP_CHECKSUM handling · 10df8e61
      Eric Dumazet 提交于
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10df8e61
  11. 09 9月, 2016 1 次提交
  12. 17 5月, 2016 1 次提交
  13. 12 5月, 2016 1 次提交
    • D
      net: original ingress device index in PKTINFO · 0b922b7a
      David Ahern 提交于
      Applications such as OSPF and BFD need the original ingress device not
      the VRF device; the latter can be derived from the former. To that end
      add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
      the skb control buffer similar to IPv6. From there the pktinfo can just
      pull it from cb with the PKTINFO_SKB_CB cast.
      
      The previous patch moving the skb->dev change to L3 means nothing else
      is needed for IPv6; it just works.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b922b7a
  14. 26 4月, 2016 1 次提交
  15. 14 4月, 2016 1 次提交
  16. 08 4月, 2016 1 次提交
  17. 05 4月, 2016 1 次提交
  18. 17 2月, 2016 1 次提交
  19. 13 2月, 2016 1 次提交
  20. 11 2月, 2016 1 次提交
  21. 05 11月, 2015 1 次提交
  22. 24 6月, 2015 1 次提交
  23. 07 6月, 2015 1 次提交
    • E
      inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations · 90c337da
      Eric Dumazet 提交于
      When an application needs to force a source IP on an active TCP socket
      it has to use bind(IP, port=x).
      
      As most applications do not want to deal with already used ports, x is
      often set to 0, meaning the kernel is in charge to find an available
      port.
      But kernel does not know yet if this socket is going to be a listener or
      be connected.
      It has very limited choices (no full knowledge of final 4-tuple for a
      connect())
      
      With limited ephemeral port range (about 32K ports), it is very easy to
      fill the space.
      
      This patch adds a new SOL_IP socket option, asking kernel to ignore
      the 0 port provided by application in bind(IP, port=0) and only
      remember the given IP address.
      
      The port will be automatically chosen at connect() time, in a way
      that allows sharing a source port as long as the 4-tuples are unique.
      
      This new feature is available for both IPv4 and IPv6 (Thanks Neal)
      
      Tested:
      
      Wrote a test program and checked its behavior on IPv4 and IPv6.
      
      strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
      connect().
      Also getsockname() show that the port is still 0 right after bind()
      but properly allocated after connect().
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
      setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      
      IPv6 test :
      
      socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
      setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      
      I was able to bind()/connect() a million concurrent IPv4 sockets,
      instead of ~32000 before patch.
      
      lpaa23:~# ulimit -n 1000010
      lpaa23:~# ./bind --connect --num-flows=1000000 &
      1000000 sockets
      
      lpaa23:~# grep TCP /proc/net/sockstat
      TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
      
      Check that a given source port is indeed used by many different
      connections :
      
      lpaa23:~# ss -t src :40000 | head -10
      State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
      ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
      ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
      ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
      ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
      ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
      ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
      ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90c337da
  24. 04 4月, 2015 2 次提交
  25. 19 3月, 2015 2 次提交
  26. 09 3月, 2015 1 次提交
    • W
      ip: fix error queue empty skb handling · c247f053
      Willem de Bruijn 提交于
      When reading from the error queue, msg_name and msg_control are only
      populated for some errors. A new exception for empty timestamp skbs
      added a false positive on icmp errors without payload.
      
      `traceroute -M udpconn` only displayed gateways that return payload
      with the icmp error: the embedded network headers are pulled before
      sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise.
      
      Fix this regression by refining when msg_name and msg_control
      branches are taken. The solutions for the two fields are independent.
      
      msg_name only makes sense for errors that configure serr->port and
      serr->addr_offset. Test the first instead of skb->len. This also fixes
      another issue. saddr could hold the wrong data, as serr->addr_offset
      is not initialized  in some code paths, pointing to the start of the
      network header. It is only valid when serr->port is set (non-zero).
      
      msg_control support differs between IPv4 and IPv6. IPv4 only honors
      requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The
      skb->len test can simply be removed, because skb->dev is also tested
      and never true for empty skbs. IPv6 honors requests for all errors
      aside from local errors and timestamps on empty skbs.
      
      In both cases, make the policy more explicit by moving this logic to
      a new function that decides whether to process msg_control and that
      optionally prepares the necessary fields in skb->cb[]. After this
      change, the IPv4 and IPv6 paths are more similar.
      
      The last case is rxrpc. Here, simply refine to only match timestamps.
      
      Fixes: 49ca0d8b ("net-timestamp: no-payload option")
      Reported-by: NJan Niehusmann <jan@gondor.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1->v2
        - fix local origin test inversion in ip6_datagram_support_cmsg
        - make v4 and v6 code paths more similar by introducing analogous
          ipv4_datagram_support_cmsg
        - fix compile bug in rxrpc
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c247f053
  27. 03 2月, 2015 1 次提交
    • W
      net-timestamp: no-payload option · 49ca0d8b
      Willem de Bruijn 提交于
      Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
      timestamps, this loops timestamps on top of empty packets.
      
      Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
      cmsg reception (aside from timestamps) are no longer possible. This
      works together with a follow on patch that allows administrators to
      only allow tx timestamping if it does not loop payload or metadata.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes (rfc -> v1)
        - add documentation
        - remove unnecessary skb->len test (thanks to Richard Cochran)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ca0d8b
  28. 16 1月, 2015 1 次提交
  29. 06 1月, 2015 3 次提交
  30. 11 12月, 2014 1 次提交
  31. 09 12月, 2014 2 次提交
    • W
      net-timestamp: allow reading recv cmsg on errqueue with origin tstamp · 829ae9d6
      Willem de Bruijn 提交于
      Allow reading of timestamps and cmsg at the same time on all relevant
      socket families. One use is to correlate timestamps with egress
      device, by asking for cmsg IP_PKTINFO.
      
      on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
      avoid changing legacy expectations, only do so if the caller sets a
      new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
      
      on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
      returned for all origins. only change is to set ifindex, which is
      not initialized for all error origins.
      
      In both cases, only generate the pktinfo message if an ifindex is
      known. This is not the case for ACK timestamps.
      
      The difference between the protocol families is probably a historical
      accident as a result of the different conditions for generating cmsg
      in the relevant ip(v6)_recv_error function:
      
      ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
      ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
      
      At one time, this was the same test bar for the ICMP/ICMP6
      distinction. This is no longer true.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1 -> v2
          large rewrite
          - integrate with existing pktinfo cmsg generation code
          - on ipv4: only send with new flag, to maintain legacy behavior
          - on ipv6: send at most a single pktinfo cmsg
          - on ipv6: initialize fields if not yet initialized
      
      The recv cmsg interfaces are also relevant to the discussion of
      whether looping packet headers is problematic. For v6, cmsgs that
      identify many headers are already returned. This patch expands
      that to v4. If it sounds reasonable, I will follow with patches
      
      1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
         (http://patchwork.ozlabs.org/patch/366967/)
      2. sysctl to conditionally drop all timestamps that have payload or
         cmsg from users without CAP_NET_RAW.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829ae9d6
    • W
      ipv4: warn once on passing AF_INET6 socket to ip_recv_error · 7ce875e5
      Willem de Bruijn 提交于
      One line change, in response to catching an occurrence of this bug.
      See also fix f4713a3d ("net-timestamp: make tcp_recvmsg call ...")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce875e5
  32. 12 11月, 2014 1 次提交
  33. 06 11月, 2014 1 次提交
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
  34. 10 9月, 2014 1 次提交
  35. 02 9月, 2014 1 次提交
    • W
      sock: deduplicate errqueue dequeue · 364a9e93
      Willem de Bruijn 提交于
      sk->sk_error_queue is dequeued in four locations. All share the
      exact same logic. Deduplicate.
      
      Also collapse the two critical sections for dequeue (at the top of
      the recv handler) and signal (at the bottom).
      
      This moves signal generation for the next packet forward, which should
      be harmless.
      
      It also changes the behavior if the recv handler exits early with an
      error. Previously, a signal for follow-up packets on the errqueue
      would then not be scheduled. The new behavior, to always signal, is
      arguably a bug fix.
      
      For rxrpc, the change causes the same function to be called repeatedly
      for each queued packet (because the recv handler == sk_error_report).
      It is likely that all packets will fail for the same reason (e.g.,
      memory exhaustion).
      
      This code runs without sk_lock held, so it is not safe to trust that
      sk->sk_err is immutable inbetween releasing q->lock and the subsequent
      test. Introduce int err just to avoid this potential race.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      364a9e93