1. 16 7月, 2016 1 次提交
  2. 12 7月, 2016 2 次提交
    • P
      ipv4: af_inet: make it explicitly non-modular · d3fc0353
      Paul Gortmaker 提交于
      The Makefile controlling compilation of this file is obj-y,
      meaning that it currently is never being built as a module.
      
      Since MODULE_ALIAS is a no-op for non-modular code, we can simply
      remove the MODULE_ALIAS_NETPROTO variant used here.
      
      We replace module.h with kmod.h since the file does make use of
      request_module() in order to load other modules from here.
      
      We don't have to worry about init.h coming in via the removed
      module.h since the file explicitly includes init.h already.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3fc0353
    • S
      tunnels: correct conditional build of MPLS and IPv6 · aa9667e7
      Simon Horman 提交于
      Using a combination if #if conditionals and goto labels to unwind
      tunnel4_init seems unwieldy. This patch takes a simpler approach of
      directly unregistering previously registered protocols when an error
      occurs.
      
      This fixes a number of problems with the current implementation
      including the potential presence of labels when they are unused
      and the potential absence of unregister code when it is needed.
      
      Fixes: 8afe97e5 ("tunnels: support MPLS over IPv4 tunnels")
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa9667e7
  3. 10 7月, 2016 3 次提交
  4. 03 7月, 2016 1 次提交
    • J
      netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF · c37a2dfa
      Joe Perches 提交于
      netfilter uses multiple FWINV #defines with identical form that hide a
      specific structure variable and dereference it with a invflags member.
      
      $ git grep "#define FWINV"
      include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
      net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
      net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
      net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
      net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
      net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))
      
      Consolidate these macros into a single NF_INVF macro.
      
      Miscellanea:
      
      o Neaten the alignment around these uses
      o A few lines are > 80 columns for intelligibility
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c37a2dfa
  5. 01 7月, 2016 2 次提交
  6. 30 6月, 2016 2 次提交
    • S
      ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output · fedbb6b4
      Shmulik Ladkani 提交于
      ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
      calls ip_sk_use_pmtu which casts sk as an inet_sk).
      
      However, in the case of UDP tunneling, the skb->sk is not necessarily an
      inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
      tun/tap).
      
      OTOH, the sk passed as an argument throughout IP stack's output path is
      the one which is of PMTU interest:
       - In case of local sockets, sk is same as skb->sk;
       - In case of a udp tunnel, sk is the tunneling socket.
      
      Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
      This augments 7026b1dd 'netfilter: Pass socket pointer down through okfn().'
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fedbb6b4
    • A
      tcp: add an ability to dump and restore window parameters · b1ed4c4f
      Andrey Vagin 提交于
      We found that sometimes a restored tcp socket doesn't work.
      
      A reason of this bug is incorrect window parameters and in this case
      tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
      other side drops packets with this seq, because seq is less than
      tp->rcv_nxt ( tcp_sequence() ).
      
      Data from a send queue is sent only if there is enough space in a
      window, so when we restore unacked data, we need to expand a window to
      fit this data.
      
      This was in a first version of this patch:
      "tcp: extend window to fit all restored unacked data in a send queue"
      
      Then Alexey recommended me to restore window parameters instead of
      adjusted them according with data in a sent queue. This sounds resonable.
      
      rcv_wnd has to be restored, because it was reported to another side
      and the offered window is never shrunk.
      One of reasons why we need to restore snd_wnd was described above.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1ed4c4f
  7. 29 6月, 2016 1 次提交
  8. 28 6月, 2016 2 次提交
  9. 24 6月, 2016 1 次提交
  10. 23 6月, 2016 1 次提交
  11. 19 6月, 2016 2 次提交
    • E
      ipv6: RFC 4884 partial support for SIT/GRE tunnels · 20e1954f
      Eric Dumazet 提交于
      When receiving an ICMPv4 message containing extensions as
      defined in RFC 4884, and translating it to ICMPv6 at SIT
      or GRE tunnel, we need some extra manipulation in order
      to properly forward the extensions.
      
      This patch only takes care of Time Exceeded messages as they
      are the ones that typically carry information from various
      routers in a fabric during a traceroute session.
      
      It also avoids complex skb logic if the data_len is not
      a multiple of 8.
      
      RFC states :
      
         The "original datagram" field MUST contain at least 128 octets.
         If the original datagram did not contain 128 octets, the
         "original datagram" field MUST be zero padded to 128 octets.
      
      In practice routers use 128 bytes of original datagram, not more.
      
      Initial translation was added in commit ca15a078
      ("sit: generate icmpv6 error when receiving icmpv4 error")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20e1954f
    • E
      gre: better support for ICMP messages for gre+ipv6 · 9b8c6d7b
      Eric Dumazet 提交于
      ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
      support of ipv4+gre+icmp+ipv6+... frames, used for example
      by traceroute/mtr.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b8c6d7b
  12. 18 6月, 2016 3 次提交
  13. 17 6月, 2016 1 次提交
    • A
      net: xfrm: fix old-style declaration · 318d3cc0
      Arnd Bergmann 提交于
      Modern C standards expect the '__inline__' keyword to come before the return
      type in a declaration, and we get a couple of warnings for this with "make W=1"
      in the xfrm{4,6}_policy.c files:
      
      net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
       static int inline xfrm6_net_sysctl_init(struct net *net)
      net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
       static void inline xfrm6_net_sysctl_exit(struct net *net)
      net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
       static int inline xfrm4_net_sysctl_init(struct net *net)
      net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
       static void inline xfrm4_net_sysctl_exit(struct net *net)
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      318d3cc0
  14. 16 6月, 2016 2 次提交
    • E
      gre: fix error handler · e582615a
      Eric Dumazet 提交于
      1) gre_parse_header() can be called from gre_err()
      
         At this point transport header points to ICMP header, not the inner
      header.
      
      2) We can not really change transport header as ipgre_err() will later
      assume transport header still points to ICMP header (using icmp_hdr())
      
      3) pskb_may_pull() logic in gre_parse_header() really works
        if we are interested at zone pointed by skb->data
      
      4) As Jiri explained in commit b7f8fe25 ("gre: do not pull header in
      ICMP error processing") we should not pull headers in error handler.
      
      So this fix :
      
      A) changes gre_parse_header() to use skb->data instead of
      skb_transport_header()
      
      B) Adds a nhs parameter to gre_parse_header() so that we can skip the
      not pulled IP header from error path.
        This offset is 0 for normal receive path.
      
      C) remove obsolete IPV6 includes
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Jiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e582615a
    • P
      net: ipv4: Add ability to have GRE ignore DF bit in IPv4 payloads · 22a59be8
      Philip Prindeville 提交于
          In the presence of firewalls which improperly block ICMP Unreachable
          (including Fragmentation Required) messages, Path MTU Discovery is
          prevented from working.
      
          A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
          is done for other payloads like AppleTalk--and doing transparent
          fragmentation and reassembly.
      
          Redux includes the enforcement of mutual exclusion between this feature
          and Path MTU Discovery as suggested by Alexander Duyck.
      
          Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NPhilip Prindeville <philipp@redfish-solutions.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22a59be8
  15. 15 6月, 2016 5 次提交
  16. 12 6月, 2016 1 次提交
  17. 11 6月, 2016 3 次提交
  18. 09 6月, 2016 1 次提交
    • D
      net: Add l3mdev rule · 96c63fa7
      David Ahern 提交于
      Currently, VRFs require 1 oif and 1 iif rule per address family per
      VRF. As the number of VRF devices increases it brings scalability
      issues with the increasing rule list. All of the VRF rules have the
      same format with the exception of the specific table id to direct the
      lookup. Since the table id is available from the oif or iif in the
      loopup, the VRF rules can be consolidated to a single rule that pulls
      the table from the VRF device.
      
      This patch introduces a new rule attribute l3mdev. The l3mdev rule
      means the table id used for the lookup is pulled from the L3 master
      device (e.g., VRF) rather than being statically defined. With the
      l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
      rule per address family (IPv4 and IPv6).
      
      If an admin wishes to insert higher priority rules for specific VRFs
      those rules will co-exist with the l3mdev rule. This capability means
      current VRF scripts will co-exist with this new simpler implementation.
      
      Currently, the rules list for both ipv4 and ipv6 look like this:
          $ ip  ru ls
          1000:       from all oif vrf1 lookup 1001
          1000:       from all iif vrf1 lookup 1001
          1000:       from all oif vrf2 lookup 1002
          1000:       from all iif vrf2 lookup 1002
          1000:       from all oif vrf3 lookup 1003
          1000:       from all iif vrf3 lookup 1003
          1000:       from all oif vrf4 lookup 1004
          1000:       from all iif vrf4 lookup 1004
          1000:       from all oif vrf5 lookup 1005
          1000:       from all iif vrf5 lookup 1005
          1000:       from all oif vrf6 lookup 1006
          1000:       from all iif vrf6 lookup 1006
          1000:       from all oif vrf7 lookup 1007
          1000:       from all iif vrf7 lookup 1007
          1000:       from all oif vrf8 lookup 1008
          1000:       from all iif vrf8 lookup 1008
          ...
          32765:      from all lookup local
          32766:      from all lookup main
          32767:      from all lookup default
      
      With the l3mdev rule the list is just the following regardless of the
      number of VRFs:
          $ ip ru ls
          1000:       from all lookup [l3mdev table]
          32765:      from all lookup local
          32766:      from all lookup main
          32767:      from all lookup default
      
      (Note: the above pretty print of the rule is based on an iproute2
             prototype. Actual verbage may change)
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96c63fa7
  19. 08 6月, 2016 2 次提交
    • P
      tcp: accept RST if SEQ matches right edge of right-most SACK block · e00431bc
      Pau Espin Pedrol 提交于
      RFC 5961 advises to only accept RST packets containing a seq number
      matching the next expected seq number instead of the whole receive
      window in order to avoid spoofing attacks.
      
      However, this situation is not optimal in the case SACK is in use at the
      time the RST is sent. I recently run into a scenario in which packet
      losses were high while uploading data to a server, and userspace was
      willing to frequently terminate connections by sending a RST. In
      this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
      for a lost packet retransmission and SACK blocks are used to let the
      client continue uploading data. At some point later on, the client sends
      the RST (snd_nxt), which matches the next expected seq number of the
      right-most SACK block on the receiver side which is going forward
      receiving data.
      
      In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
      frozen main ACK at receiver side and thus gets dropped and a challenge
      ACK is sent, which gets usually lost due to network conditions. The main
      consequence is that the connection stays alive for a while even if it
      made sense to accept the RST. This can get really bad if lots of
      connections like this one are created in few seconds, allocating all the
      resources of the server easily.
      
      For security reasons, not all SACK blocks are checked (there could be a
      big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
      wouldn't make sense to check for RST in blocks other than the right-most
      received one because the sender is not expected to be sending new data
      after the RST. For simplicity, only up to the 4 most recently updated
      SACK blocks (selective_acks[4] field) are compared to find the
      right-most block, as usually those are the ones with bigger probability
      to contain it.
      
      This patch was tested in a 3.18 kernel and probed to improve the
      situation in the scenario described above.
      Signed-off-by: NPau Espin Pedrol <pau.espin@tessares.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e00431bc
    • T
      gue: Implement direction IP encapsulation · c1e48af7
      Tom Herbert 提交于
      This patch implements direct encapsulation of IPv4 and IPv6 packets
      in UDP. This is done a version "1" of GUE and as explained in I-D
      draft-ietf-nvo3-gue-03.
      
      Changes here are only in the receive path, fou with IPxIPx already
      supports the transmit side. Both the normal receive path and
      GRO path are modified to check for GUE version and check for
      IP version in the case that GUE version is "1".
      
      Tested:
      
      IPIP with direct GUE encap
        1 TCP_STREAM
          4530 Mbps
        200 TCP_RR
          1297625 tps
          135/232/444 90/95/99% latencies
      
      IP4IP6 with direct GUE encap
        1 TCP_STREAM
          4903 Mbps
        200 TCP_RR
          1184481 tps
          149/253/473 90/95/99% latencies
      
      IP6IP6 direct GUE encap
        1 TCP_STREAM
         5146 Mbps
        200 TCP_RR
          1202879 tps
          146/251/472 90/95/99% latencies
      
      SIT with direct GUE encap
        1 TCP_STREAM
          6111 Mbps
        200 TCP_RR
          1250337 tps
          139/241/467 90/95/99% latencies
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1e48af7
  20. 06 6月, 2016 1 次提交
    • M
      net: disable fragment reassembly if high_thresh is zero · 30759219
      Michal Kubeček 提交于
      Before commit 6d7b857d ("net: use lib/percpu_counter API for
      fragmentation mem accounting"), setting the reassembly high threshold
      to 0 prevented fragment reassembly as first fragment would be always
      evicted before second could be added to the queue. While inefficient,
      some users apparently relied on this method.
      
      Since the commit mentioned above, a percpu counter is used for
      reassembly memory accounting and high batch size avoids taking slow path
      in most common scenarios. As a result, a whole full sized packet can be
      reassembled without the percpu counter's main counter changing its value
      so that even with high_thresh set to 0, fragmented packets can be still
      reassembled and processed.
      
      Add explicit check preventing reassembly if high threshold is zero.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30759219
  21. 04 6月, 2016 1 次提交
  22. 03 6月, 2016 1 次提交
  23. 24 5月, 2016 1 次提交
    • E
      ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=n · 049bbf58
      Ezequiel Garcia 提交于
      Commit fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
      moves the default TTL assignment, and as side-effect IPv4 TTL now
      has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).
      
      The sysctl_ip_default_ttl is fundamental for IP to work properly,
      as it provides the TTL to be used as default. The defautl TTL may be
      used in ip_selected_ttl, through the following flow:
      
        ip_select_ttl
          ip4_dst_hoplimit
            net->ipv4.sysctl_ip_default_ttl
      
      This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
      in net_init_net, called during ipv4's initialization.
      
      Without this commit, a kernel built without sysctl support will send
      all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
      with setsockopt).
      
      Given a similar issue might appear on the other knobs that were
      namespaceify, this commit also moves them.
      
      Fixes: fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
      Signed-off-by: NEzequiel Garcia <ezequiel@vanguardiasur.com.ar>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      049bbf58