1. 11 9月, 2016 2 次提交
  2. 10 9月, 2016 2 次提交
    • E
      ip_tunnel: do not clear l4 hashes · bf8d85d4
      Eric Dumazet 提交于
      If skb has a valid l4 hash, there is no point clearing hash and force
      a further flow dissection when a tunnel encapsulation is added.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf8d85d4
    • G
      ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events · b93e1fa7
      Guillaume Nault 提交于
      fib_table_insert() inconsistently fills the nlmsg_flags field in its
      notification messages.
      
      Since commit b8f55831 ("[RTNETLINK]: Fix sending netlink message
      when replace route."), the netlink message has its nlmsg_flags set to
      NLM_F_REPLACE if the route replaced a preexisting one.
      
      Then commit a2bb6d7d ("ipv4: include NLM_F_APPEND flag in append
      route notifications") started setting nlmsg_flags to NLM_F_APPEND if
      the route matched a preexisting one but was appended.
      
      In other cases (exclusive creation or prepend), nlmsg_flags is 0.
      
      This patch sets ->nlmsg_flags in all situations, preserving the
      semantic of the NLM_F_* bits:
      
        * NLM_F_CREATE: a new fib entry has been created for this route.
        * NLM_F_EXCL: no other fib entry existed for this route.
        * NLM_F_REPLACE: this route has overwritten a preexisting fib entry.
        * NLM_F_APPEND: the new fib entry was added after other entries for
          the same route.
      
      As a result, the possible flag combination can now be reported
      (iproute2's terminology into parentheses):
      
        * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
          ("add").
        * NLM_F_CREATE | NLM_F_APPEND: route did already exist, new route
          added after preexisting ones ("append").
        * NLM_F_CREATE: route did already exist, new route added before
          preexisting ones ("prepend").
        * NLM_F_REPLACE: route did already exist, new route replaced the
          first preexisting one ("change").
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b93e1fa7
  3. 09 9月, 2016 3 次提交
    • E
      ipv4: accept u8 in IP_TOS ancillary data · e895cdce
      Eric Dumazet 提交于
      In commit f02db315 ("ipv4: IP_TOS and IP_TTL can be specified as
      ancillary data") Francesco added IP_TOS values specified as integer.
      
      However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
      in a single byte, when IP_RECVTOS is set on the socket.
      
      It can be very useful to reflect all ancillary options as given by the
      kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
      EINVAL after Francesco patch.
      
      So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
      server can simply reuse same ancillary block without having to mangle
      it.
      
      Jesper can then augment
      https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
      to add TOS reflection ;)
      
      Fixes: f02db315 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Francesco Fusco <ffusco@redhat.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e895cdce
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
    • L
      net: inet: diag: expose the socket mark to privileged processes. · d545caca
      Lorenzo Colitti 提交于
      This adds the capability for a process that has CAP_NET_ADMIN on
      a socket to see the socket mark in socket dumps.
      
      Commit a52e95ab ("net: diag: allow socket bytecode filters to
      match socket marks") recently gave privileged processes the
      ability to filter socket dumps based on mark. This patch is
      complementary: it ensures that the mark is also passed to
      userspace in the socket's netlink attributes.  It is useful for
      tools like ss which display information about sockets.
      
      Tested: https://android-review.googlesource.com/270210Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d545caca
  4. 08 9月, 2016 1 次提交
  5. 02 9月, 2016 2 次提交
  6. 31 8月, 2016 1 次提交
    • R
      net: lwtunnel: Handle fragmentation · 14972cbd
      Roopa Prabhu 提交于
      Today mpls iptunnel lwtunnel_output redirect expects the tunnel
      output function to handle fragmentation. This is ok but can be
      avoided if we did not do the mpls output redirect too early.
      ie we could wait until ip fragmentation is done and then call
      mpls output for each ip fragment.
      
      To make this work we will need,
      1) the lwtunnel state to carry encap headroom
      2) and do the redirect to the encap output handler on the ip fragment
      (essentially do the output redirect after fragmentation)
      
      This patch adds tunnel headroom in lwtstate to make sure we
      account for tunnel data in mtu calculations during fragmentation
      and adds new xmit redirect handler to redirect to lwtunnel xmit func
      after ip fragmentation.
      
      This includes IPV6 and some mtu fixes and testing from David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14972cbd
  7. 30 8月, 2016 2 次提交
  8. 29 8月, 2016 2 次提交
    • E
      tcp: add tcp_add_backlog() · c9c33212
      Eric Dumazet 提交于
      When TCP operates in lossy environments (between 1 and 10 % packet
      losses), many SACK blocks can be exchanged, and I noticed we could
      drop them on busy senders, if these SACK blocks have to be queued
      into the socket backlog.
      
      While the main cause is the poor performance of RACK/SACK processing,
      we can try to avoid these drops of valuable information that can lead to
      spurious timeouts and retransmits.
      
      Cause of the drops is the skb->truesize overestimation caused by :
      
      - drivers allocating ~2048 (or more) bytes as a fragment to hold an
        Ethernet frame.
      
      - various pskb_may_pull() calls bringing the headers into skb->head
        might have pulled all the frame content, but skb->truesize could
        not be lowered, as the stack has no idea of each fragment truesize.
      
      The backlog drops are also more visible on bidirectional flows, since
      their sk_rmem_alloc can be quite big.
      
      Let's add some room for the backlog, as only the socket owner
      can selectively take action to lower memory needs, like collapsing
      receive queues or partial ofo pruning.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9c33212
    • T
      tcp: Set read_sock and peek_len proto_ops · 32035585
      Tom Herbert 提交于
      In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
      tcp_peek_len (which is just a stub function that calls tcp_inq).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32035585
  9. 26 8月, 2016 2 次提交
  10. 25 8月, 2016 2 次提交
  11. 24 8月, 2016 7 次提交
  12. 23 8月, 2016 2 次提交
    • G
      net: ipconfig: Fix NULL pointer dereference on RARP/BOOTP/DHCP timeout · 1ae292a2
      Geert Uytterhoeven 提交于
      If no RARP, BOOTP, or DHCP response is received, ic_dev is never set,
      causing a NULL pointer dereference in ic_close_devs():
      
          Sending DHCP requests ...... timed out!
          Unable to handle kernel NULL pointer dereference at virtual address 00000004
      
      To fix this, add a check to avoid dereferencing ic_dev if it is still
      NULL.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Fixes: 2647cffb ("net: ipconfig: Support using "delayed" DHCP replies")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ae292a2
    • S
      net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset · c0451fe1
      Shmulik Ladkani 提交于
      In b8247f09,
      
         "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"
      
      gso skbs arriving from an ingress interface that go through UDP
      tunneling, are allowed to be fragmented if the resulting encapulated
      segments exceed the dst mtu of the egress interface.
      
      This aligned the behavior of gso skbs to non-gso skbs going through udp
      encapsulation path.
      
      However the non-gso vs gso anomaly is present also in the following
      cases of a GRE tunnel:
       - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
         (e.g. OvS vport-gre with df_default=false)
       - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set
      
      In both of the above cases, the non-gso skbs get fragmented, whereas the
      gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
      as they don't go through the segment+fragment code path.
      
      Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.
      
      Tunnels that do set IP_DF, will not go to fragmentation of segments.
      This preserves behavior of ip_gre in (the default) pmtudisc mode.
      
      Fixes: b8247f09 ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
      Reported-by: Nwenxu <wenxu@ucloud.cn>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Tested-by: Nwenxu <wenxu@ucloud.cn>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0451fe1
  13. 20 8月, 2016 3 次提交
  14. 19 8月, 2016 3 次提交
  15. 18 8月, 2016 1 次提交
  16. 16 8月, 2016 1 次提交
  17. 13 8月, 2016 1 次提交
    • P
      netfilter: remove ip_conntrack* sysctl compat code · adf05168
      Pablo Neira Ayuso 提交于
      This backward compatibility has been around for more than ten years,
      since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
      alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
      the conntrack utility got adopted by many people in the user community
      according to what I observed on the netfilter user mailing list.
      
      So let's get rid of this.
      
      Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
      not need to be exported as symbol anymore.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      adf05168
  18. 12 8月, 2016 2 次提交
  19. 11 8月, 2016 1 次提交