1. 29 8月, 2015 4 次提交
    • D
      net: Add helper function to compare inetpeer addresses · d39d14ff
      David Ahern 提交于
      tcp_metrics and inetpeer both have functions to compare inetpeer
      addresses. Consolidate into 1 version.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d39d14ff
    • D
      net: Add set,get helpers for inetpeer addresses · 3abef286
      David Ahern 提交于
      Use inetpeer set,get helpers in tcp_metrics rather than peeking into
      the inetpeer_addr struct.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3abef286
    • D
      net: Introduce ipv4_addr_hash and use it for tcp metrics · 72afa352
      David Ahern 提交于
      Refactors a common line into helper function.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72afa352
    • P
      IGMP: Inhibit reports for local multicast groups · df2cf4a7
      Philip Downey 提交于
      The range of addresses between 224.0.0.0 and 224.0.0.255 inclusive, is
      reserved for the use of routing protocols and other low-level topology
      discovery or maintenance protocols, such as gateway discovery and
      group membership reporting.  Multicast routers should not forward any
      multicast datagram with destination addresses in this range,
      regardless of its TTL.
      
      Currently, IGMP reports are generated for this reserved range of
      addresses even though a router will ignore this information since it
      has no purpose.  However, the presence of reserved group addresses in
      an IGMP membership report uses up network bandwidth and can also
      obscure addresses of interest when inspecting membership reports using
      packet inspection or debug messages.
      
      Although the RFCs for the various version of IGMP (e.g.RFC 3376 for
      v3) do not specify that the reserved addresses be excluded from
      membership reports, it should do no harm in doing so.  In particular
      there should be no adverse effect in any IGMP snooping functionality
      since 224.0.0.x is specifically excluded as per RFC 4541 (IGMP and MLD
      Snooping Switches Considerations) section 2.1.2. Data Forwarding
      Rules:
      
          2) Packets with a destination IP (DIP) address in the 224.0.0.X
             range which are not IGMP must be forwarded on all ports.
      
      IGMP reports for local multicast groups can now be optionally
      inhibited by means of a system control variable (by setting the value
      to zero) e.g.:
          echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports
      
      To retain backwards compatibility the previous behaviour is retained
      by default on system boot or reverted by setting the value back to
      non-zero e.g.:
          echo 1 >  /proc/sys/net/ipv4/igmp_link_local_mcast_reports
      Signed-off-by: NPhilip Downey <pdowney@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df2cf4a7
  2. 28 8月, 2015 2 次提交
  3. 26 8月, 2015 4 次提交
    • D
      ah4: Fix error return in ah_input(). · 94c10f0e
      David S. Miller 提交于
      Noticed by Herbert Xu.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94c10f0e
    • E
      tcp: refine pacing rate determination · 43e122b0
      Eric Dumazet 提交于
      When TCP pacing was added back in linux-3.12, we chose
      to apply a fixed ratio of 200 % against current rate,
      to allow probing for optimal throughput even during
      slow start phase, where cwnd can be doubled every other gRTT.
      
      At Google, we found it was better applying a different ratio
      while in Congestion Avoidance phase.
      This ratio was set to 120 %.
      
      We've used the normal tcp_in_slow_start() helper for a while,
      then tuned the condition to select the conservative ratio
      as soon as cwnd >= ssthresh/2 :
      
      - After cwnd reduction, it is safer to ramp up more slowly,
        as we approach optimal cwnd.
      - Initial ramp up (ssthresh == INFINITY) still allows doubling
        cwnd every other RTT.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43e122b0
    • D
      xfrm: Use VRF master index if output device is enslaved · 4ec3b28c
      David Ahern 提交于
      Directs route lookups to VRF table. Compiles out if NET_VRF is not
      enabled. With this patch able to successfully bring up ipsec tunnels
      in VRFs, even with duplicate network configuration.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ec3b28c
    • E
      tcp: fix slow start after idle vs TSO/GSO · 6f021c62
      Eric Dumazet 提交于
      slow start after idle might reduce cwnd, but we perform this
      after first packet was cooked and sent.
      
      With TSO/GSO, it means that we might send a full TSO packet
      even if cwnd should have been reduced to IW10.
      
      Moving the SSAI check in skb_entail() makes sense, because
      we slightly reduce number of times this check is done,
      especially for large send() and TCP Small queue callbacks from
      softirq context.
      
      As Neal pointed out, we also need to perform the check
      if/when receive window opens.
      
      Tested:
      
      Following packetdrill test demonstrates the problem
      // Test of slow start after idle
      
      `sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.100 < . 1:1(0) ack 1 win 511
      +0    accept(3, ..., ...) = 4
      +0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
      
      +0    write(4, ..., 26000) = 26000
      +0    > . 1:5001(5000) ack 1
      +0    > . 5001:10001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      +.100 < . 1:1(0) ack 10001 win 511
      +0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
      +0    > . 10001:20001(10000) ack 1
      +0    > P. 20001:26001(6000) ack 1
      
      +.100 < . 1:1(0) ack 26001 win 511
      +0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
      
      +4 write(4, ..., 20000) = 20000
      // If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
      +0    > . 26001:31001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      +0    > . 31001:36001(5000) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f021c62
  4. 25 8月, 2015 1 次提交
  5. 24 8月, 2015 2 次提交
  6. 21 8月, 2015 6 次提交
  7. 20 8月, 2015 1 次提交
    • N
      vrf: vrf_master_ifindex_rcu is not always called with rcu read lock · 18041e31
      Nikolay Aleksandrov 提交于
      While running net-next I hit this:
      [  634.073119] ===============================
      [  634.073150] [ INFO: suspicious RCU usage. ]
      [  634.073182] 4.2.0-rc6+ #45 Not tainted
      [  634.073213] -------------------------------
      [  634.073244] include/net/vrf.h:38 suspicious rcu_dereference_check()
      usage!
      [  634.073274]
                     other info that might help us debug this:
      
      [  634.073307]
                     rcu_scheduler_active = 1, debug_locks = 1
      [  634.073338] 2 locks held by swapper/0/0:
      [  634.073369]  #0:  (((&n->timer))){+.-...}, at: [<ffffffff8112bc35>]
      call_timer_fn+0x5/0x480
      [  634.073412]  #1:  (slock-AF_INET){+.-...}, at: [<ffffffff8174f0f5>]
      icmp_send+0x155/0x5f0
      [  634.073450]
                     stack backtrace:
      [  634.073483] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6+ #45
      [  634.073514] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
      VirtualBox 12/01/2006
      [  634.073545]  0000000000000000 0593ba8242d9ace4 ffff88002fc03b48
      ffffffff81803f1b
      [  634.073612]  0000000000000000 ffffffff81e12500 ffff88002fc03b78
      ffffffff811003c5
      [  634.073642]  0000000000000000 ffff88002ec4e600 ffffffff81f00f80
      ffff88002fc03cf0
      [  634.073669] Call Trace:
      [  634.073694]  <IRQ>  [<ffffffff81803f1b>] dump_stack+0x4c/0x65
      [  634.073728]  [<ffffffff811003c5>] lockdep_rcu_suspicious+0xc5/0x100
      [  634.073763]  [<ffffffff8174eb56>] icmp_route_lookup+0x176/0x5c0
      [  634.073793]  [<ffffffff8174f2fb>] ? icmp_send+0x35b/0x5f0
      [  634.073818]  [<ffffffff8174f274>] ? icmp_send+0x2d4/0x5f0
      [  634.073844]  [<ffffffff8174f3ce>] icmp_send+0x42e/0x5f0
      [  634.073873]  [<ffffffff8170b662>] ipv4_link_failure+0x22/0xa0
      [  634.073899]  [<ffffffff8174bdda>] arp_error_report+0x3a/0x80
      [  634.073926]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
      [  634.073952]  [<ffffffff816d396e>] neigh_invalidate+0x8e/0x110
      [  634.073984]  [<ffffffff816d62ae>] neigh_timer_handler+0x1ae/0x290
      [  634.074013]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
      [  634.074013]  [<ffffffff8112bce3>] call_timer_fn+0xb3/0x480
      [  634.074013]  [<ffffffff8112bc35>] ? call_timer_fn+0x5/0x480
      [  634.074013]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
      [  634.074013]  [<ffffffff8112c2bc>] run_timer_softirq+0x20c/0x430
      [  634.074013]  [<ffffffff810af50e>] __do_softirq+0xde/0x630
      [  634.074013]  [<ffffffff810afc97>] irq_exit+0x117/0x120
      [  634.074013]  [<ffffffff81810976>] smp_apic_timer_interrupt+0x46/0x60
      [  634.074013]  [<ffffffff8180e950>] apic_timer_interrupt+0x70/0x80
      [  634.074013]  <EOI>  [<ffffffff8106b9d6>] ? native_safe_halt+0x6/0x10
      [  634.074013]  [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10
      [  634.074013]  [<ffffffff81027d43>] default_idle+0x23/0x200
      [  634.074013]  [<ffffffff8102852f>] arch_cpu_idle+0xf/0x20
      [  634.074013]  [<ffffffff810f89ba>] default_idle_call+0x2a/0x40
      [  634.074013]  [<ffffffff810f8dcc>] cpu_startup_entry+0x39c/0x4c0
      [  634.074013]  [<ffffffff817f9cad>] rest_init+0x13d/0x150
      [  634.074013]  [<ffffffff81f69038>] start_kernel+0x4a8/0x4c9
      [  634.074013]  [<ffffffff81f68120>] ?
      early_idt_handler_array+0x120/0x120
      [  634.074013]  [<ffffffff81f68339>] x86_64_start_reservations+0x2a/0x2c
      [  634.074013]  [<ffffffff81f68485>] x86_64_start_kernel+0x14a/0x16d
      
      It would seem vrf_master_ifindex_rcu() can be called without RCU held in
      other contexts as well so introduce a new helper which acquires rcu and
      returns the ifindex.
      Also add curly braces around both the "if" and "else" parts as per the
      style guide.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18041e31
  8. 19 8月, 2015 2 次提交
  9. 18 8月, 2015 7 次提交
    • T
      net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool · 4b048d6d
      Tom Herbert 提交于
      inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
      the checksum field carries a pseudo header. This argument should be a
      boolean instead of an int.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b048d6d
    • T
      lwt: Add support to redirect dst.input · 25368623
      Tom Herbert 提交于
      This patch adds the capability to redirect dst input in the same way
      that dst output is redirected by LWT.
      
      Also, save the original dst.input and and dst.out when setting up
      lwtunnel redirection. These can be called by the client as a pass-
      through.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25368623
    • D
      netfilter: nf_conntrack: add efficient mark to zone mapping · 5e8018fc
      Daniel Borkmann 提交于
      This work adds the possibility of deriving the zone id from the skb->mark
      field in a scalable manner. This allows for having only a single template
      serving hundreds/thousands of different zones, for example, instead of the
      need to have one match for each zone as an extra CT jump target.
      
      Note that we'd need to have this information attached to the template as at
      the time when we're trying to lookup a possible ct object, we already need
      to know zone information for a possible match when going into
      __nf_conntrack_find_get(). This work provides a minimal implementation for
      a possible mapping.
      
      In order to not add/expose an extra ct->status bit, the zone structure has
      been extended to carry a flag for deriving the mark.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5e8018fc
    • D
      netfilter: nf_conntrack: add direction support for zones · deedb590
      Daniel Borkmann 提交于
      This work adds a direction parameter to netfilter zones, so identity
      separation can be performed only in original/reply or both directions
      (default). This basically opens up the possibility of doing NAT with
      conflicting IP address/port tuples from multiple, isolated tenants
      on a host (e.g. from a netns) without requiring each tenant to NAT
      twice resp. to use its own dedicated IP address to SNAT to, meaning
      overlapping tuples can be made unique with the zone identifier in
      original direction, where the NAT engine will then allocate a unique
      tuple in the commonly shared default zone for the reply direction.
      In some restricted, local DNAT cases, also port redirection could be
      used for making the reply traffic unique w/o requiring SNAT.
      
      The consensus we've reached and discussed at NFWS and since the initial
      implementation [1] was to directly integrate the direction meta data
      into the existing zones infrastructure, as opposed to the ct->mark
      approach we proposed initially.
      
      As we pass the nf_conntrack_zone object directly around, we don't have
      to touch all call-sites, but only those, that contain equality checks
      of zones. Thus, based on the current direction (original or reply),
      we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
      CT expectations are direction-agnostic entities when expectations are
      being compared among themselves, so we can only use the identifier
      in this case.
      
      Note that zone identifiers can not be included into the hash mix
      anymore as they don't contain a "stable" value that would be equal
      for both directions at all times, f.e. if only zone->id would
      unconditionally be xor'ed into the table slot hash, then replies won't
      find the corresponding conntracking entry anymore.
      
      If no particular direction is specified when configuring zones, the
      behaviour is exactly as we expect currently (both directions).
      
      Support has been added for the CT netlink interface as well as the
      x_tables raw CT target, which both already offer existing interfaces
      to user space for the configuration of zones.
      
      Below a minimal, simplified collision example (script in [2]) with
      netperf sessions:
      
        +--- tenant-1 ---+   mark := 1
        |    netperf     |--+
        +----------------+  |                CT zone := mark [ORIGINAL]
         [ip,sport] := X   +--------------+  +--- gateway ---+
                           | mark routing |--|     SNAT      |-- ... +
                           +--------------+  +---------------+       |
        +--- tenant-2 ---+  |                                     ~~~|~~~
        |    netperf     |--+                +-----------+           |
        +----------------+   mark := 2       | netserver |------ ... +
         [ip,sport] := X                     +-----------+
                                              [ip,port] := Y
      On the gateway netns, example:
      
        iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
        iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully
      
        iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
        iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark
      
      conntrack dump from gateway netns:
      
        netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns
      
        tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                              src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                              src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2
      
      Taking this further, test script in [2] creates 200 tenants and runs
      original-tuple colliding netperf sessions each. A conntrack -L dump in
      the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
      state as expected.
      
      I also did run various other tests with some permutations of the script,
      to mention some: SNAT in random/random-fully/persistent mode, no zones (no
      overlaps), static zones (original, reply, both directions), etc.
      
        [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
        [2] https://paste.fedoraproject.org/242835/65657871/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      deedb590
    • D
      inet: Move VRF table lookup to inlined function · dc028da5
      David Ahern 提交于
      Table lookup compiles out when VRF is not enabled.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc028da5
    • J
      lwtunnel: rename ip lwtunnel attributes · a1c234f9
      Jiri Benc 提交于
      We already have IFLA_IPTUN_ netlink attributes. The IP_TUN_ attributes look
      very similar, yet they serve very different purpose. This is confusing for
      anyone trying to implement a user space tool supporting lwt.
      
      As the IP_TUN_ attributes are used only for the lightweight tunnels, prefix
      them with LWTUNNEL_IP_ instead to make their purpose clear. Also, it's more
      logical to have them in lwtunnel.h together with the encap enum.
      
      Fixes: 3093fbe7 ("route: Per route IP tunnel metadata via lightweight tunnel")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1c234f9
    • C
      Revert "net: limit tcp/udp rmem/wmem to SOCK_{RCV,SND}BUF_MIN" · 5d37852b
      Calvin Owens 提交于
      Commit 8133534c ("net: limit tcp/udp rmem/wmem to
      SOCK_{RCV,SND}BUF_MIN") modified four sysctls to enforce that the values
      written to them are not less than SOCK_MIN_{RCV,SND}BUF.
      
      That change causes 4096 to no longer be accepted as a valid value for
      'min' in tcp_wmem and udp_wmem_min. 4096 has been the default for both
      of those sysctls for a long time, and unfortunately seems to be an
      extremely popular setting. This change breaks a large number of sysctl
      configurations at Facebook.
      
      That commit referred to b1cb59cf ("net: sysctl_net_core: check
      SNDBUF and RCVBUF for min length"), which choose to use the SOCK_MIN
      constants as the lower limits to avoid nasty bugs. But AFAICS, a limit
      of SOCK_MIN_SNDBUF isn't necessary to do that: the BUG_ON cited in the
      commit message seems to have happened because unix_stream_sendmsg()
      expects a minimum of a full page (ie SK_MEM_QUANTUM) and the math broke,
      not because it had less than SOCK_MIN_SNDBUF allocated.
      
      This particular issue doesn't seem to affect TCP however: using a
      setting of "1 1 1" for tcp_{r,w}mem works, although it's obviously
      suboptimal. SK_MEM_QUANTUM would be a nice minimum, but it's 64K on
      some archs, so there would still be breakage.
      
      Since a value of one doesn't seem to cause any problems, we can drop the
      minimum 8133534c added to fix this.
      
      This reverts commit 8133534c.
      
      Fixes: 8133534c ("net: limit tcp/udp rmem/wmem to SOCK_MIN...")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sorin Dumitru <sorin@returnze.ro>
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d37852b
  10. 17 8月, 2015 1 次提交
  11. 14 8月, 2015 10 次提交