1. 11 5月, 2018 4 次提交
  2. 09 5月, 2018 6 次提交
  3. 07 5月, 2018 1 次提交
  4. 03 5月, 2018 2 次提交
    • E
      tcp: restore autocorking · 114f39fe
      Eric Dumazet 提交于
      When adding rb-tree for TCP retransmit queue, we inadvertently broke
      TCP autocorking.
      
      tcp_should_autocork() should really check if the rtx queue is not empty.
      
      Tested:
      
      Before the fix :
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      2682.85   2.47     1.59     3.618   2.329
      TcpExtTCPAutoCorking            33                 0.0
      
      // Same test, but forcing TCP_NODELAY
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      1408.75   2.44     2.96     6.802   8.259
      TcpExtTCPAutoCorking            1                  0.0
      
      After the fix :
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      5472.46   2.45     1.43     1.761   1.027
      TcpExtTCPAutoCorking            361293             0.0
      
      // With TCP_NODELAY option
      $ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      540000 262144    500    10.00      5454.96   2.46     1.63     1.775   1.174
      TcpExtTCPAutoCorking            315448             0.0
      
      Fixes: 75c119af ("tcp: implement rb-tree based retransmit queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Wenig <mwenig@vmware.com>
      Tested-by: NMichael Wenig <mwenig@vmware.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Wenig <mwenig@vmware.com>
      Tested-by: NMichael Wenig <mwenig@vmware.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      114f39fe
    • J
      ipv4: fix fnhe usage by non-cached routes · 94720e3a
      Julian Anastasov 提交于
      Allow some non-cached routes to use non-expired fnhe:
      
      1. ip_del_fnhe: moved above and now called by find_exception.
      The 4.5+ commit deed49df expires fnhe only when caching
      routes. Change that to:
      
      1.1. use fnhe for non-cached local output routes, with the help
      from (2)
      
      1.2. allow __mkroute_input to detect expired fnhe (outdated
      fnhe_gw, for example) when do_cache is false, eg. when itag!=0
      for unicast destinations.
      
      2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
      to use fnhe info even when the new route will not be cached into fnhe.
      After commit 839da4d9 ("net: ipv4: set orig_oif based on fib
      result for local traffic") it means all local routes will be affected
      because they are not cached. This change is used to solve a PMTU
      problem with IPVS (and probably Netfilter DNAT) setups that redirect
      local clients from target local IP (local route to Virtual IP)
      to new remote IP target, eg. IPVS TUN real server. Loopback has
      64K MTU and we need to create fnhe on the local route that will
      keep the reduced PMTU for the Virtual IP. Without this change
      fnhe_pmtu is updated from ICMP but never exposed to non-cached
      local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
      with flowi4_oif=any for 4.14+).
      
      3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
      new entries
      
      Fixes: 839da4d9 ("net: ipv4: set orig_oif based on fib result for local traffic")
      Fixes: d6d5e999 ("route: do not cache fib route info on local routes with oif")
      Fixes: deed49df ("route: check and remove route cache when we get route")
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94720e3a
  5. 02 5月, 2018 5 次提交
    • N
      tcp_bbr: fix to zero idle_restart only upon S/ACKed data · e6e6a278
      Neal Cardwell 提交于
      Previously the bbr->idle_restart tracking was zeroing out the
      bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
      e.g. receiving incoming data or receiver window updates. In such
      situations BBR would forget that this was a restart-from-idle
      situation, and if the min_rtt had expired it would unnecessarily enter
      PROBE_RTT (even though we were actually restarting from idle but had
      merely forgotten that fact).
      
      The fix is simple: we need to remember we are restarting from idle
      until we receive a S/ACK for some data (a S/ACK for the first flight
      of data we send as we are restarting).
      
      This commit is a stable candidate for kernels back as far as 4.9.
      
      Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NYousuk Seung <ysseung@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6e6a278
    • S
      udp: Complement partial checksum for GSO packet · 6c035ba7
      Sean Tranchetti 提交于
      Using the udp_v4_check() function to calculate the pseudo header
      for the newly segmented UDP packets results in assigning the complement
      of the value to the UDP header checksum field.
      
      Always undo the complement the partial checksum value in order to
      match the case where GSO is not used on the UDP transmit path.
      
      Fixes: ee80d1eb ("udp: add udp gso")
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c035ba7
    • S
      tcp: send in-queue bytes in cmsg upon read · b75eba76
      Soheil Hassas Yeganeh 提交于
      Applications with many concurrent connections, high variance
      in receive queue length and tight memory bounds cannot
      allocate worst-case buffer size to drain sockets. Knowing
      the size of receive queue length, applications can optimize
      how they allocate buffers to read from the socket.
      
      The number of bytes pending on the socket is directly
      available through ioctl(FIONREAD/SIOCINQ) and can be
      approximated using getsockopt(MEMINFO) (rmem_alloc includes
      skb overheads in addition to application data). But, both of
      these options add an extra syscall per recvmsg. Moreover,
      ioctl(FIONREAD/SIOCINQ) takes the socket lock.
      
      Add the TCP_INQ socket option to TCP. When this socket
      option is set, recvmsg() relays the number of bytes available
      on the socket for reading to the application via the
      TCP_CM_INQ control message.
      
      Calculate the number of bytes after releasing the socket lock
      to include the processed backlog, if any. To avoid an extra
      branch in the hot path of recvmsg() for this new control
      message, move all cmsg processing inside an existing branch for
      processing receive timestamps. Since the socket lock is not held
      when calculating the size of receive queue, TCP_INQ is a hint.
      For example, it can overestimate the queue size by one byte,
      if FIN is received.
      
      With this method, applications can start reading from the socket
      using a small buffer, and then use larger buffers based on the
      remaining data when needed.
      
      V3 change-log:
      	As suggested by David Miller, added loads with barrier
      	to check whether we have multiple threads calling recvmsg
      	in parallel. When that happens we lock the socket to
      	calculate inq.
      V4 change-log:
      	Removed inline from a static function.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75eba76
    • W
      udp: disable gso with no_check_tx · a8c744a8
      Willem de Bruijn 提交于
      Syzbot managed to send a udp gso packet without checksum offload into
      the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
      triggered the skb_warn_bad_offload.
      
        RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
         skb_gso_segment include/linux/netdevice.h:4038 [inline]
         validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
         __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
         dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
      
      UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
      udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
      catch and fail in this case.
      
      After the integrity checks jump directly to the CHECKSUM_PARTIAL case
      to avoid reading the no_check_tx flags again (a TOCTTOU race).
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8c744a8
    • E
      tcp: fix TCP_REPAIR_QUEUE bound checking · bf2acc94
      Eric Dumazet 提交于
      syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
      with following C-repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
      sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
      	1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
      setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
      writev(3, [{"\270", 1}], 1)             = 1
      setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
      writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144
      
      The 3rd system call looks odd :
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      
      This patch makes sure bound checking is using an unsigned compare.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf2acc94
  6. 01 5月, 2018 1 次提交
  7. 30 4月, 2018 2 次提交
    • W
      erspan: auto detect truncated packets. · 1baf5ebf
      William Tu 提交于
      Currently the truncated bit is set only when the mirrored packet
      is larger than mtu.  For certain cases, the packet might already
      been truncated before sending to the erspan tunnel.  In this case,
      the patch detect whether the IP header's total length is larger
      than the actual skb->len.  If true, this indicated that the
      mirrored packet is truncated and set the erspan truncate bit.
      
      I tested the patch using bpf_skb_change_tail helper function to
      shrink the packet size and send to erspan tunnel.
      Reported-by: NXiaoyan Jin <xiaoyanj@vmware.com>
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1baf5ebf
    • E
      tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive · 05255b82
      Eric Dumazet 提交于
      When adding tcp mmap() implementation, I forgot that socket lock
      had to be taken before current->mm->mmap_sem. syzbot eventually caught
      the bug.
      
      Since we can not lock the socket in tcp mmap() handler we have to
      split the operation in two phases.
      
      1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
        This operation does not involve any TCP locking.
      
      2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
       the transfert of pages from skbs to one VMA.
        This operation only uses down_read(&current->mm->mmap_sem) after
        holding TCP lock, thus solving the lockdep issue.
      
      This new implementation was suggested by Andy Lutomirski with great details.
      
      Benefits are :
      
      - Better scalability, in case multiple threads reuse VMAS
         (without mmap()/munmap() calls) since mmap_sem wont be write locked.
      
      - Better error recovery.
         The previous mmap() model had to provide the expected size of the
         mapping. If for some reason one part could not be mapped (partial MSS),
         the whole operation had to be aborted.
         With the tcp_zerocopy_receive struct, kernel can report how
         many bytes were successfuly mapped, and how many bytes should
         be read to skip the problematic sequence.
      
      - No more memory allocation to hold an array of page pointers.
        16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
      
      - skbs are freed while mmap_sem has been released
      
      Following patch makes the change in tcp_mmap tool to demonstrate
      one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
      
      Note that memcg might require additional changes.
      
      Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Cc: linux-mm@kvack.org
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05255b82
  8. 28 4月, 2018 2 次提交
  9. 27 4月, 2018 7 次提交
    • Y
      tcp: ignore Fast Open on repair mode · 16ae6aa1
      Yuchung Cheng 提交于
      The TCP repair sequence of operation is to first set the socket in
      repair mode, then inject the TCP stats into the socket with repair
      socket options, then call connect() to re-activate the socket. The
      connect syscall simply returns and set state to ESTABLISHED
      mode. As a result Fast Open is meaningless for TCP repair.
      
      However allowing sendto() system call with MSG_FASTOPEN flag half-way
      during the repair operation could unexpectedly cause data to be
      sent, before the operation finishes changing the internal TCP stats
      (e.g. MSS).  This in turn triggers TCP warnings on inconsistent
      packet accounting.
      
      The fix is to simply disallow Fast Open operation once the socket
      is in the repair mode.
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16ae6aa1
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: paged allocation with gso · 15e36f5b
      Willem de Bruijn 提交于
      When sending large datagrams that are later segmented, store data in
      page frags to avoid copying from linear in skb_segment.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e36f5b
    • W
      udp: better wmem accounting on gso · ad405857
      Willem de Bruijn 提交于
      skb_segment by default transfers allocated wmem from the gso skb
      to the tail of the segment list. This underreports real truesize
      of the list, especially if the tail might be dropped.
      
      Similar to tcp_gso_segment, update wmem_alloc with the aggregate
      list truesize and make each segment responsible for its own
      share by setting skb->destructor.
      
      Clear gso_skb->destructor prior to calling skb_segment to skip
      the default assignment to tail.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad405857
    • W
      udp: generate gso with UDP_SEGMENT · bec1f6f6
      Willem de Bruijn 提交于
      Support generic segmentation offload for udp datagrams. Callers can
      concatenate and send at once the payload of multiple datagrams with
      the same destination.
      
      To set segment size, the caller sets socket option UDP_SEGMENT to the
      length of each discrete payload. This value must be smaller than or
      equal to the relevant MTU.
      
      A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
      per send call basis.
      
      Total byte length may then exceed MTU. If not an exact multiple of
      segment size, the last segment will be shorter.
      
      The implementation adds a gso_size field to the udp socket, ip(v6)
      cmsg cookie and inet_cork structure to be able to set the value at
      setsockopt or cmsg time and to work with both lockless and corked
      paths.
      
      Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
      
          tcp tso
           3197 MB/s 54232 msg/s 54232 calls/s
               6,457,754,262      cycles
      
          tcp gso
           1765 MB/s 29939 msg/s 29939 calls/s
              11,203,021,806      cycles
      
          tcp without tso/gso *
            739 MB/s 12548 msg/s 12548 calls/s
              11,205,483,630      cycles
      
          udp
            876 MB/s 14873 msg/s 624666 calls/s
              11,205,777,429      cycles
      
          udp gso
           2139 MB/s 36282 msg/s 36282 calls/s
              11,204,374,561      cycles
      
         [*] after reverting commit 0a6b2a1d
             ("tcp: switch to GSO being always on")
      
      Measured total system cycles ('-a') for one core while pinning both
      the network receive path and benchmark process to that core:
      
        perf stat -a -C 12 -e cycles \
          ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
      
      Note the reduction in calls/s with GSO. Bytes per syscall drops
      increases from 1470 to 61818.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec1f6f6
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
    • W
      udp: expose inet cork to udp · 1cd7884d
      Willem de Bruijn 提交于
      UDP segmentation offload needs access to inet_cork in the udp layer.
      Pass the struct to ip(6)_make_skb instead of allocating it on the
      stack in that function itself.
      
      This patch is a noop otherwise.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cd7884d
  10. 25 4月, 2018 7 次提交
    • C
      ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers · c04d2cb2
      Chris Novakovic 提交于
      Distributed filesystems are most effective when the server and client
      clocks are synchronised. Embedded devices often use NFS for their
      root filesystem but typically do not contain an RTC, so the clocks of
      the NFS server and the embedded device will be out-of-sync when the root
      filesystem is mounted (and may not be synchronised until late in the
      boot process).
      
      Extend ipconfig with the ability to export IP addresses of NTP servers
      it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as
      follows:
      
       - If ipconfig is configured manually via the "ip=" or "nfsaddrs="
         kernel command line parameters, one NTP server can be specified in
         the new "<ntp0-ip>" parameter.
       - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in
         the DHCPDISCOVER message, and record the IP addresses of up to three
         NTP servers sent by the responding DHCP server in the subsequent
         DHCPOFFER message.
      
      ipconfig will only write the NTP server IP addresses it discovers to
      /proc/net/ipconfig/ntp_servers, one per line (in the order received from
      the DHCP server, if DHCP autoconfiguration is used); making use of these
      NTP servers is the responsibility of a user space process (e.g. an
      initrd/initram script that invokes an NTP client before mounting an NFS
      root filesystem).
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c04d2cb2
    • C
      ipconfig: Create /proc/net/ipconfig directory · 4d019b3f
      Chris Novakovic 提交于
      To allow ipconfig to report IP configuration details to user space
      processes without cluttering /proc/net, create a new subdirectory
      /proc/net/ipconfig. All files containing IP configuration details should
      be written to this directory.
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d019b3f
    • C
      ipconfig: Correctly initialise ic_nameservers · 300eec7c
      Chris Novakovic 提交于
      ic_nameservers, which stores the list of name servers discovered by
      ipconfig, is initialised (i.e. has all of its elements set to NONE, or
      0xffffffff) by ic_nameservers_predef() in the following scenarios:
      
       - before the "ip=" and "nfsaddrs=" kernel command line parameters are
         parsed (in ip_auto_config_setup());
       - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in
         order to clear any values that may have been set after parsing "ip="
         or "nfsaddrs=" and are no longer needed.
      
      This means that ic_nameservers_predef() is not called when neither "ip="
      nor "nfsaddrs=" is specified on the kernel command line. In this
      scenario, every element in ic_nameservers remains set to 0x00000000,
      which is indistinguishable from ANY and causes pnp_seq_show() to write
      the following (bogus) information to /proc/net/pnp:
      
        #MANUAL
        nameserver 0.0.0.0
        nameserver 0.0.0.0
        nameserver 0.0.0.0
      
      This is potentially problematic for systems that blindly link
      /etc/resolv.conf to /proc/net/pnp.
      
      Ensure that ic_nameservers is also initialised when neither "ip=" nor
      "nfsaddrs=" are specified by calling ic_nameservers_predef() in
      ip_auto_config(), but only when ip_auto_config_setup() was not called
      earlier. This causes the following to be written to /proc/net/pnp, and
      is consistent with what gets written when ipconfig is configured
      manually but no name servers are specified on the kernel command line:
      
        #MANUAL
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      300eec7c
    • C
      ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers · de1fa15b
      Chris Novakovic 提交于
      When ipconfig is autoconfigured via BOOTP, the request packet
      initialised by ic_bootp_init_ext() always allocates 8 bytes for the name
      server option, limiting the BOOTP server to responding with at most 2
      name servers even though ipconfig in fact supports an arbitrary number
      of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently
      3).
      
      Only request name servers in the request packet if CONF_NAMESERVERS_MAX
      is positive (to comply with [1, §3.8]), and allocate enough space in the
      packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum
      number we can accept in response.
      
      [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
          https://tools.ietf.org/rfc/rfc2132.txtSigned-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de1fa15b
    • C
      ipconfig: BOOTP: Don't request IEN-116 name servers · 4e1a8af2
      Chris Novakovic 提交于
      When ipconfig is autoconfigured via BOOTP, the request packet
      initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name
      Server" [1, §3.7]), but tag 5 in the response isn't processed by
      ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name
      Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears
      to have been the intended tag to request.
      
      This won't cause any breakage for existing users, as tag 5 responses
      provided by BOOTP servers weren't being processed anyway.
      
      [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
          https://tools.ietf.org/rfc/rfc2132.txtSigned-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e1a8af2
    • C
      ipconfig: Tidy up reporting of name servers · e18bdc83
      Chris Novakovic 提交于
      Commit 5e953778 ("ipconfig: add
      nameserver IPs to kernel-parameter ip=") adds the IP addresses of
      discovered name servers to the summary printed by ipconfig when
      configuration is complete. It appears the intention in ip_auto_config()
      was to print the name servers on a new line (especially given the
      spacing and lack of comma before "nameserver0="), but they're actually
      printed on the same line as the NFS root filesystem configuration
      summary:
      
        [    0.686186] IP-Config: Complete:
        [    0.686226]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
        [    0.686328]      host=test, domain=example.com, nis-domain=(none)
        [    0.686386]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=     nameserver0=10.0.0.1
      
      This makes it harder to read and parse ipconfig's output. Instead, print
      the name servers on a separate line:
      
        [    0.791250] IP-Config: Complete:
        [    0.791289]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
        [    0.791407]      host=test, domain=example.com, nis-domain=(none)
        [    0.791475]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
        [    0.791476]      nameserver0=10.0.0.1
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e18bdc83
    • E
      tcp: md5: only call tp->af_specific->md5_lookup() for md5 sockets · 8c2320e8
      Eric Dumazet 提交于
      RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive,
      given they have no result.
      We can omit the calls for sockets that have no md5 keys.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c2320e8
  11. 24 4月, 2018 3 次提交
    • Y
      Revert "net: init sk_cookie for inet socket" · a06ac0d6
      Yafang Shao 提交于
      This reverts commit <c6849a3a> ("net: init sk_cookie for inet socket")
      
      Per discussion with Eric, when update sock_net(sk)->cookie_gen, the
      whole cache cache line will be invalidated, as this cache line is shared
      with all cpus, that may cause great performace hit.
      
      Bellow is the data form Eric.
      "Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on
      my host" when running synflood test.
      
      Have to revert it to prevent from cache line false sharing.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a06ac0d6
    • T
      netfilter: xtables: use ipt_get_target_c instead of ipt_get_target · dc3c09d3
      Taehee Yoo 提交于
      ipt_get_target is used to get struct xt_entry_target
      and ipt_get_target_c is used to get const struct xt_entry_target.
      However in the ipt_do_table, ipt_get_target is used to get
      const struct xt_entry_target. it should be replaced by ipt_get_target_c.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      dc3c09d3
    • T
      netfilter: add NAT support for shifted portmap ranges · 2eb0f624
      Thierry Du Tre 提交于
      This is a patch proposal to support shifted ranges in portmaps.  (i.e. tcp/udp
      incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100)
      
      Currently DNAT only works for single port or identical port ranges.  (i.e.
      ports 5000-5100 on WAN interface redirected to a LAN host while original
      destination port is not altered) When different port ranges are configured,
      either 'random' mode should be used, or else all incoming connections are
      mapped onto the first port in the redirect range. (in described example
      WAN:5000-5100 will all be mapped to 192.168.1.5:2000)
      
      This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET
      which uses a base port value to calculate an offset with the destination port
      present in the incoming stream. That offset is then applied as index within the
      redirect port range (index modulo rangewidth to handle range overflow).
      
      In described example the base port would be 5000. An incoming stream with
      destination port 5004 would result in an offset value 4 which means that the
      NAT'ed stream will be using destination port 2004.
      
      Other possibilities include deterministic mapping of larger or multiple ranges
      to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port
      51xx)
      
      This patch does not change any current behavior. It just adds new NAT proto
      range functionality which must be selected via the specific flag when intended
      to use.
      
      A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed
      which makes this functionality immediately available.
      Signed-off-by: NThierry Du Tre <thierry@dtsystems.be>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2eb0f624