1. 17 2月, 2014 1 次提交
  2. 14 2月, 2014 2 次提交
    • F
      ipv4: ipconfig.c: add parentheses in an if statement · 357137a4
      FX Le Bail 提交于
      Even if the 'time_before' macro expand with parentheses, the look is bad.
      Signed-off-by: NFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      357137a4
    • F
      net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f
      Florian Westphal 提交于
      Marcelo Ricardo Leitner reported problems when the forwarding link path
      has a lower mtu than the incoming one if the inbound interface supports GRO.
      
      Given:
      Host <mtu1500> R1 <mtu1200> R2
      
      Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.
      
      In this case, the kernel will fail to send ICMP fragmentation needed
      messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
      checks in forward path. Instead, Linux tries to send out packets exceeding
      the mtu.
      
      When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
      not fragment the packets when forwarding, and again tries to send out
      packets exceeding R1-R2 link mtu.
      
      This alters the forwarding dstmtu checks to take the individual gso
      segment lengths into account.
      
      For ipv6, we send out pkt too big error for gso if the individual
      segments are too big.
      
      For ipv4, we either send icmp fragmentation needed, or, if the DF bit
      is not set, perform software segmentation and let the output path
      create fragments when the packet is leaving the machine.
      It is not 100% correct as the error message will contain the headers of
      the GRO skb instead of the original/segmented one, but it seems to
      work fine in my (limited) tests.
      
      Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
      sofware segmentation.
      
      However it turns out that skb_segment() assumes skb nr_frags is related
      to mss size so we would BUG there.  I don't want to mess with it considering
      Herbert and Eric disagree on what the correct behavior should be.
      
      Hannes Frederic Sowa notes that when we would shrink gso_size
      skb_segment would then also need to deal with the case where
      SKB_MAX_FRAGS would be exceeded.
      
      This uses sofware segmentation in the forward path when we hit ipv4
      non-DF packets and the outgoing link mtu is too small.  Its not perfect,
      but given the lack of bug reports wrt. GRO fwd being broken this is a
      rare case anyway.  Also its not like this could not be improved later
      once the dust settles.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe6cc55f
  3. 11 2月, 2014 1 次提交
    • J
      tcp: tsq: fix nonagle handling · bf06200e
      John Ogness 提交于
      Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
      regression for applications using TCP_NODELAY.
      
      If TCP session is throttled because of tsq, we should consult
      tp->nonagle when TX completion is done and allow us to send additional
      segment, especially if this segment is not a full MSS.
      Otherwise this segment is sent after an RTO.
      
      [edumazet] : Cooked the changelog, added another fix about testing
      sk_wmem_alloc twice because TX completion can happen right before
      setting TSQ_THROTTLED bit.
      
      This problem is particularly visible with recent auto corking,
      but might also be triggered with low tcp_limit_output_bytes
      values or NIC drivers delaying TX completion by hundred of usec,
      and very low rtt.
      
      Thomas Glanzmann for example reported an iscsi regression, caused
      by tcp auto corking making this bug quite visible.
      
      Fixes: 46d3ceab ("tcp: TCP Small Queues")
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThomas Glanzmann <thomas@glanzmann.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf06200e
  4. 10 2月, 2014 1 次提交
  5. 07 2月, 2014 2 次提交
    • E
      tcp: remove 1ms offset in srtt computation · 4a5ab4e2
      Eric Dumazet 提交于
      TCP pacing depends on an accurate srtt estimation.
      
      Current srtt estimation is using jiffie resolution,
      and has an artificial offset of at least 1 ms, which can produce
      slowdowns when FQ/pacing is used, especially in DC world,
      where typical rtt is below 1 ms.
      
      We are planning a switch to usec resolution for linux-3.15,
      but in the meantime, this patch removes the 1 ms offset.
      
      All we need is to have tp->srtt minimal value of 1 to differentiate
      the case of srtt being initialized or not, not 8.
      
      The problematic behavior was observed on a 40Gbit testbed,
      where 32 concurrent netperf were reaching 12Gbps of aggregate
      speed, instead of line speed.
      
      This patch also has the effect of reporting more accurate srtt and send
      rates to iproute2 ss command as in :
      
      $ ss -i dst cca2
      Netid  State      Recv-Q Send-Q          Local Address:Port
      Peer Address:Port
      tcp    ESTAB      0      0                10.244.129.1:56984
      10.244.129.2:12865
      	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
      463.4Mbps rcv_rtt:1 rcv_space:29200
      tcp    ESTAB      0      390960           10.244.129.1:60247
      10.244.129.2:50204
      	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
      send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
      Reported-by: NVytautas Valancius <valas@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a5ab4e2
    • G
      ipv4: Fix runtime WARNING in rtmsg_ifa() · 63b5f152
      Geert Uytterhoeven 提交于
      On m68k/ARAnyM:
      
      WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99()
      Modules linked in:
      CPU: 0 PID: 407 Comm: ifconfig Not tainted
      3.13.0-atari-09263-g0c71d68014d1 #1378
      Stack from 10c4fdf0:
              10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f
              00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080
              c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001
              00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002
              008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90
              eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020
      Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c
       [<00024416>] warn_slowpath_null+0x14/0x1a
       [<002501b4>] rtmsg_ifa+0xdc/0xf0
       [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2
       [<0024f9b4>] inet_abc_len+0x0/0x42
       [<00250b52>] inet_insert_ifa+0xc/0x12
       [<00252390>] devinet_ioctl+0x2ae/0x5d6
      
      Adding some debugging code reveals that net_fill_ifaddr() fails in
      
          put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                                    preferred, valid))
      
      nla_put complains:
      
          lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20
      
      Apparently commit 5c766d64 ("ipv4:
      introduce address lifetime") forgot to take into account the addition of
      struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already
      done for ipv6.
      Suggested-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63b5f152
  6. 06 2月, 2014 3 次提交
    • P
      netfilter: nf_tables: add reject module for NFPROTO_INET · 05513e9e
      Patrick McHardy 提交于
      Add a reject module for NFPROTO_INET. It does nothing but dispatch
      to the AF-specific modules based on the hook family.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      05513e9e
    • P
      netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts · cc4723ca
      Patrick McHardy 提交于
      Currently the nft_reject module depends on symbols from ipv6. This is
      wrong since no generic module should force IPv6 support to be loaded.
      Split up the module into AF-specific and a generic part.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cc4723ca
    • A
      netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report() · 829d9315
      Alexey Dobriyan 提交于
      Similar bug fixed in SIP module in 3f509c68 ("netfilter: nf_nat_sip: fix
      incorrect handling of EBUSY for RTCP expectation").
      
      BUG: unable to handle kernel paging request at 00100104
      IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
      ...
      Call Trace:
        [<c0244bd8>] ? del_timer+0x48/0x70
        [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
        [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
        [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c024442d>] call_timer_fn+0x1d/0x80
        [<c024461e>] run_timer_softirq+0x18e/0x1a0
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c023e6f3>] __do_softirq+0xa3/0x170
        [<c023e650>] ? __local_bh_enable+0x70/0x70
        <IRQ>
        [<c023e587>] ? irq_exit+0x67/0xa0
        [<c0202af6>] ? do_IRQ+0x46/0xb0
        [<c027ad05>] ? clockevents_notify+0x35/0x110
        [<c066ac6c>] ? common_interrupt+0x2c/0x40
        [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
        [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
        [<c02085f8>] ? arch_cpu_idle+0x8/0x30
        [<c027314b>] ? cpu_idle_loop+0x4b/0x140
        [<c0273258>] ? cpu_startup_entry+0x18/0x20
        [<c066056d>] ? rest_init+0x5d/0x70
        [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
        [<c081364f>] ? repair_env_string+0x5b/0x5b
        [<c0813269>] ? i386_start_kernel+0x33/0x35
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      829d9315
  7. 05 2月, 2014 1 次提交
  8. 04 2月, 2014 1 次提交
  9. 31 1月, 2014 1 次提交
  10. 28 1月, 2014 3 次提交
    • D
      net: gre: use icmp_hdr() to get inner ip header · c0c0c50f
      Duan Jiong 提交于
      When dealing with icmp messages, the skb->data points the
      ip header that triggered the sending of the icmp message.
      
      In gre_cisco_err(), the parse_gre_header() is called, and the
      iptunnel_pull_header() is called to pull the skb at the end of
      the parse_gre_header(), so the skb->data doesn't point the
      inner ip header.
      
      Unfortunately, the ipgre_err still needs those ip addresses in
      inner ip header to look up tunnel by ip_tunnel_lookup().
      
      So just use icmp_hdr() to get inner ip header instead of skb->data.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0c0c50f
    • H
      net: Fix memory leak if TPROXY used with TCP early demux · a452ce34
      Holger Eitzenberger 提交于
      I see a memory leak when using a transparent HTTP proxy using TPROXY
      together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
      
      unreferenced object 0xffff88008cba4a40 (size 1696):
        comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
        hex dump (first 32 bytes):
          0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
          02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
          [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
          [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
          [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
          [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
          [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
          [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
          [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
          [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
          [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
          [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
          [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
          [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
          [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
          [<ffffffff8127c949>] net_rx_action+0xa7/0x200
          [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
      
      But there are many more, resulting in the machine going OOM after some
      days.
      
      From looking at the TPROXY code, and with help from Florian, I see
      that the memory leak is introduced in tcp_v4_early_demux():
      
        void tcp_v4_early_demux(struct sk_buff *skb)
        {
          /* ... */
      
          iph = ip_hdr(skb);
          th = tcp_hdr(skb);
      
          if (th->doff < sizeof(struct tcphdr) / 4)
              return;
      
          sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                             iph->saddr, th->source,
                             iph->daddr, ntohs(th->dest),
                             skb->skb_iif);
          if (sk) {
              skb->sk = sk;
      
      where the socket is assigned unconditionally to skb->sk, also bumping
      the refcnt on it.  This is problematic, because in our case the skb
      has already a socket assigned in the TPROXY target.  This then results
      in the leak I see.
      
      The very same issue seems to be with IPv6, but haven't tested.
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NHolger Eitzenberger <holger@eitzenberger.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a452ce34
    • S
      net: ipv4: Use PTR_ERR_OR_ZERO · 27d79f3b
      Sachin Kamat 提交于
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27d79f3b
  11. 25 1月, 2014 1 次提交
  12. 24 1月, 2014 5 次提交
  13. 23 1月, 2014 2 次提交
    • V
      net/ipv4: queue work on power efficient wq · 906e073f
      viresh kumar 提交于
      Workqueue used in ipv4 layer have no real dependency of scheduling these on the
      cpu which scheduled them.
      
      On a idle system, it is observed that an idle cpu wakes up many times just to
      service this work. It would be better if we can schedule it on a cpu which the
      scheduler believes to be the most appropriate one.
      
      This patch replaces normal workqueues with power efficient versions. This
      doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
      enabled.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      906e073f
    • C
      tcp: metrics: Fix rcu-race when deleting multiple entries · 00ca9c5b
      Christoph Paasch 提交于
      In bbf852b9 I introduced the tmlist, which allows to delete
      multiple entries from the cache that match a specified destination if no
      source-IP is specified.
      
      However, as the cache is an RCU-list, we should not create this tmlist, as
      it will change the tcpm_next pointer of the element that will be deleted
      and so a thread iterating over the cache's entries while holding the
      RCU-lock might get "redirected" to this tmlist.
      
      This patch fixes this, by reverting back to the old behavior prior to
      bbf852b9, which means that we simply change the tcpm_next
      pointer of the previous element (pp) to jump over the one we are
      deleting.
      The difference is that we call kfree_rcu() directly on the cache entry,
      which allows us to delete multiple entries from the list.
      
      Fixes: bbf852b9 (tcp: metrics: Delete all entries matching a certain destination)
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00ca9c5b
  14. 22 1月, 2014 4 次提交
  15. 20 1月, 2014 1 次提交
    • H
      ipv6: make IPV6_RECVPKTINFO work for ipv4 datagrams · 4b261c75
      Hannes Frederic Sowa 提交于
      We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
      for IPv4 datagrams on IPv6 sockets.
      
      This patch splits the ip6_datagram_recv_ctl into two functions, one
      which handles both protocol families, AF_INET and AF_INET6, while the
      ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.
      
      ip6_datagram_recv_*_ctl never reported back any errors, so we can make
      them return void. Also provide a helper for protocols which don't offer dual
      personality to further use ip6_datagram_recv_ctl, which is exported to
      modules.
      
      I needed to shuffle the code for ping around a bit to make it easier to
      implement dual personality for ping ipv6 sockets in future.
      Reported-by: NGert Doering <gert@space.net>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b261c75
  16. 19 1月, 2014 2 次提交
  17. 18 1月, 2014 3 次提交
  18. 17 1月, 2014 1 次提交
    • P
      net/ipv4: don't use module_init in non-modular gre_offload · cf172283
      Paul Gortmaker 提交于
      Recent commit 438e38fa
      ("gre_offload: statically build GRE offloading support") added
      new module_init/module_exit calls to the gre_offload.c file.
      
      The file is obj-y and can't be anything other than built-in.
      Currently it can never be built modular, so using module_init
      as an alias for __initcall can be somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.  We also make the inclusion explicit.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of device_initcall
      directly in this change means that the runtime impact is
      zero -- it will remain at level 6 in initcall ordering.
      
      As for the module_exit, rather than replace it with __exitcall,
      we simply remove it, since it appears only UML does anything
      with those, and even for UML, there is no relevant cleanup
      to be done here.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf172283
  19. 15 1月, 2014 4 次提交
  20. 14 1月, 2014 1 次提交
    • N
      inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets · 70315d22
      Neal Cardwell 提交于
      Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT
      and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock
      (not just TIME_WAIT), and for such sockets the tw_substate field holds
      the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2.
      
      This brings the inet_diag state-matching code in line with the field
      it uses to populate idiag_state. This is also analogous to the info
      exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state
      and get_timewait4_sock() exports tw->tw_substate.
      
      Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state
      fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi
      state time-wait" would also return sockets in state TCP_FIN_WAIT2.
      
      This is an old bug that predates 05dbc7b5 ("tcp/dccp: remove twchain").
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70315d22