1. 21 3月, 2014 1 次提交
  2. 12 3月, 2014 1 次提交
    • E
      tcp: tcp_release_cb() should release socket ownership · c3f9b018
      Eric Dumazet 提交于
      Lars Persson reported following deadlock :
      
      -000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
      -001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
      -002 |ip_local_deliver_finish(skb = 0x8BD527A0)
      -003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
      -004 |netif_receive_skb(skb = 0x8BD527A0)
      -005 |elk_poll(napi = 0x8C770500, budget = 64)
      -006 |net_rx_action(?)
      -007 |__do_softirq()
      -008 |do_softirq()
      -009 |local_bh_enable()
      -010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
      -011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
      -012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
      -013 |tcp_release_cb(sk = 0x8BE6B2A0)
      -014 |release_sock(sk = 0x8BE6B2A0)
      -015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
      -016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
      -017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
      -018 |smb_send_kvec()
      -019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
      -020 |cifs_call_async()
      -021 |cifs_async_writev(wdata = 0x87FD6580)
      -022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
      -023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
      -024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
      -025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
      -028 |bdi_writeback_workfn(work = 0x87E4A9CC)
      -029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
      -030 |worker_thread(__worker = 0x8B045880)
      -031 |kthread(_create = 0x87CADD90)
      -032 |ret_from_kernel_thread(asm)
      
      Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
      it is running from softirq context.
      
      Lars trace involved a NIC without RX checksum support but other points
      are problematic as well, like the prequeue stuff.
      
      Problem is triggered by a timer, that found socket being owned by user.
      
      tcp_release_cb() should call tcp_write_timer_handler() or
      tcp_delack_timer_handler() in the appropriate context :
      
      BH disabled and socket lock held, but 'owned' field cleared,
      as if they were running from timer handlers.
      
      Fixes: 6f458dfb ("tcp: improve latencies of timer triggered events")
      Reported-by: NLars Persson <lars.persson@axis.com>
      Tested-by: NLars Persson <lars.persson@axis.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f9b018
  3. 07 3月, 2014 1 次提交
  4. 06 3月, 2014 1 次提交
    • N
      net: fix for a race condition in the inet frag code · 24b9bf43
      Nikolay Aleksandrov 提交于
      I stumbled upon this very serious bug while hunting for another one,
      it's a very subtle race condition between inet_frag_evictor,
      inet_frag_intern and the IPv4/6 frag_queue and expire functions
      (basically the users of inet_frag_kill/inet_frag_put).
      
      What happens is that after a fragment has been added to the hash chain
      but before it's been added to the lru_list (inet_frag_lru_add) in
      inet_frag_intern, it may get deleted (either by an expired timer if
      the system load is high or the timer sufficiently low, or by the
      fraq_queue function for different reasons) before it's added to the
      lru_list, then after it gets added it's a matter of time for the
      evictor to get to a piece of memory which has been freed leading to a
      number of different bugs depending on what's left there.
      
      I've been able to trigger this on both IPv4 and IPv6 (which is normal
      as the frag code is the same), but it's been much more difficult to
      trigger on IPv4 due to the protocol differences about how fragments
      are treated.
      
      The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
      in a RR bond, so the same flow can be seen on multiple cards at the
      same time. Then I used multiple instances of ping/ping6 to generate
      fragmented packets and flood the machines with them while running
      other processes to load the attacked machine.
      
      *It is very important to have the _same flow_ coming in on multiple CPUs
      concurrently. Usually the attacked machine would die in less than 30
      minutes, if configured properly to have many evictor calls and timeouts
      it could happen in 10 minutes or so.
      
      An important point to make is that any caller (frag_queue or timer) of
      inet_frag_kill will remove both the timer refcount and the
      original/guarding refcount thus removing everything that's keeping the
      frag from being freed at the next inet_frag_put.  All of this could
      happen before the frag was ever added to the LRU list, then it gets
      added and the evictor uses a freed fragment.
      
      An example for IPv6 would be if a fragment is being added and is at
      the stage of being inserted in the hash after the hash lock is
      released, but before inet_frag_lru_add executes (or is able to obtain
      the lru lock) another overlapping fragment for the same flow arrives
      at a different CPU which finds it in the hash, but since it's
      overlapping it drops it invoking inet_frag_kill and thus removing all
      guarding refcounts, and afterwards freeing it by invoking
      inet_frag_put which removes the last refcount added previously by
      inet_frag_find, then inet_frag_lru_add gets executed by
      inet_frag_intern and we have a freed fragment in the lru_list.
      
      The fix is simple, just move the lru_add under the hash chain locked
      region so when a removing function is called it'll have to wait for
      the fragment to be added to the lru_list, and then it'll remove it (it
      works because the hash chain removal is done before the lru_list one
      and there's no window between the two list adds when the frag can get
      dropped). With this fix applied I couldn't kill the same machine in 24
      hours with the same setup.
      
      Fixes: 3ef0eb0d ("net: frag, move LRU list maintenance outside of
      rwlock")
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Jesper Dangaard Brouer <brouer@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b9bf43
  5. 04 3月, 2014 2 次提交
  6. 27 2月, 2014 1 次提交
    • E
      net: tcp: use NET_INC_STATS() · 9a9bfd03
      Eric Dumazet 提交于
      While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented
      in tcp_transmit_skb() from softirq (incoming message or timer
      activation), it is better to use NET_INC_STATS() instead of
      NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process
      context.
      
      This will avoid copy/paste confusion when/if we want to add
      other SNMP counters in tcp_transmit_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a9bfd03
  7. 26 2月, 2014 1 次提交
    • H
      ipv4: ipv6: better estimate tunnel header cut for correct ufo handling · 91a48a2e
      Hannes Frederic Sowa 提交于
      Currently the UFO fragmentation process does not correctly handle inner
      UDP frames.
      
      (The following tcpdumps are captured on the parent interface with ufo
      disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
      both sit device):
      
      IPv6:
      16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
      16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
      
      We can see that fragmentation header offset is not correctly updated.
      (fragmentation id handling is corrected by 916e4cf4 ("ipv6: reuse
      ip6_frag_id from ip6_ufo_append_data")).
      
      IPv4:
      16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
          192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
      16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
          192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
      
      In this case fragmentation id is incremented and offset is not updated.
      
      First, I aligned inet_gso_segment and ipv6_gso_segment:
      * align naming of flags
      * ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
        always ensure that the state of this flag is left untouched when
        returning from upper gso segmenation function
      * ipv6_gso_segment: move skb_reset_inner_headers below updating the
        fragmentation header data, we don't care for updating fragmentation
        header data
      * remove currently unneeded comment indicating skb->encapsulation might
        get changed by upper gso_segment callback (gre and udp-tunnel reset
        encapsulation after segmentation on each fragment)
      
      If we encounter an IPIP or SIT gso skb we now check for the protocol ==
      IPPROTO_UDP and that we at least have already traversed another ip(6)
      protocol header.
      
      The reason why we have to special case GSO_IPIP and GSO_SIT is that
      we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
      protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
      Reported-by: NWolfgang Walter <linux@stwm.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91a48a2e
  8. 25 2月, 2014 1 次提交
    • E
      tcp: reduce the bloat caused by tcp_is_cwnd_limited() · d10473d4
      Eric Dumazet 提交于
      tcp_is_cwnd_limited() allows GSO/TSO enabled flows to increase
      their cwnd to allow a full size (64KB) TSO packet to be sent.
      
      Non GSO flows only allow an extra room of 3 MSS.
      
      For most flows with a BDP below 10 MSS, this results in a bloat
      of cwnd reaching 90, and an inflate of RTT.
      
      Thanks to TSO auto sizing, we can restrict the bloat to the number
      of MSS contained in a TSO packet (tp->xmit_size_goal_segs), to keep
      original intent without performance impact.
      
      Because we keep cwnd small, it helps to keep TSO packet size to their
      optimal value.
      
      Example for a 10Mbit flow, with low TCP Small queue limits (no more than
      2 skb in qdisc/device tx ring)
      
      Before patch :
      
      lpk51:~# ./ss -i dst lpk52:44862 | grep cwnd
               cubic wscale:6,6 rto:215 rtt:15.875/2.5 mss:1448 cwnd:96
      ssthresh:96
      send 70.1Mbps unacked:14 rcv_space:29200
      
      After patch :
      
      lpk51:~# ./ss -i dst lpk52:52916 | grep cwnd
               cubic wscale:6,6 rto:206 rtt:5.206/0.036 mss:1448 cwnd:15
      ssthresh:14
      send 33.4Mbps unacked:4 rcv_space:29200
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d10473d4
  9. 22 2月, 2014 1 次提交
  10. 21 2月, 2014 1 次提交
  11. 20 2月, 2014 1 次提交
  12. 18 2月, 2014 1 次提交
  13. 17 2月, 2014 2 次提交
  14. 14 2月, 2014 3 次提交
    • F
      netfilter: nf_nat_snmp_basic: fix duplicates in if/else branches · 2b7a79ba
      FX Le Bail 提交于
      The solution was found by Patrick in 2.4 kernel sources.
      
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2b7a79ba
    • F
      ipv4: ipconfig.c: add parentheses in an if statement · 357137a4
      FX Le Bail 提交于
      Even if the 'time_before' macro expand with parentheses, the look is bad.
      Signed-off-by: NFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      357137a4
    • F
      net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f
      Florian Westphal 提交于
      Marcelo Ricardo Leitner reported problems when the forwarding link path
      has a lower mtu than the incoming one if the inbound interface supports GRO.
      
      Given:
      Host <mtu1500> R1 <mtu1200> R2
      
      Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.
      
      In this case, the kernel will fail to send ICMP fragmentation needed
      messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
      checks in forward path. Instead, Linux tries to send out packets exceeding
      the mtu.
      
      When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
      not fragment the packets when forwarding, and again tries to send out
      packets exceeding R1-R2 link mtu.
      
      This alters the forwarding dstmtu checks to take the individual gso
      segment lengths into account.
      
      For ipv6, we send out pkt too big error for gso if the individual
      segments are too big.
      
      For ipv4, we either send icmp fragmentation needed, or, if the DF bit
      is not set, perform software segmentation and let the output path
      create fragments when the packet is leaving the machine.
      It is not 100% correct as the error message will contain the headers of
      the GRO skb instead of the original/segmented one, but it seems to
      work fine in my (limited) tests.
      
      Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
      sofware segmentation.
      
      However it turns out that skb_segment() assumes skb nr_frags is related
      to mss size so we would BUG there.  I don't want to mess with it considering
      Herbert and Eric disagree on what the correct behavior should be.
      
      Hannes Frederic Sowa notes that when we would shrink gso_size
      skb_segment would then also need to deal with the case where
      SKB_MAX_FRAGS would be exceeded.
      
      This uses sofware segmentation in the forward path when we hit ipv4
      non-DF packets and the outgoing link mtu is too small.  Its not perfect,
      but given the lack of bug reports wrt. GRO fwd being broken this is a
      rare case anyway.  Also its not like this could not be improved later
      once the dust settles.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe6cc55f
  15. 11 2月, 2014 1 次提交
    • J
      tcp: tsq: fix nonagle handling · bf06200e
      John Ogness 提交于
      Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
      regression for applications using TCP_NODELAY.
      
      If TCP session is throttled because of tsq, we should consult
      tp->nonagle when TX completion is done and allow us to send additional
      segment, especially if this segment is not a full MSS.
      Otherwise this segment is sent after an RTO.
      
      [edumazet] : Cooked the changelog, added another fix about testing
      sk_wmem_alloc twice because TX completion can happen right before
      setting TSQ_THROTTLED bit.
      
      This problem is particularly visible with recent auto corking,
      but might also be triggered with low tcp_limit_output_bytes
      values or NIC drivers delaying TX completion by hundred of usec,
      and very low rtt.
      
      Thomas Glanzmann for example reported an iscsi regression, caused
      by tcp auto corking making this bug quite visible.
      
      Fixes: 46d3ceab ("tcp: TCP Small Queues")
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThomas Glanzmann <thomas@glanzmann.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf06200e
  16. 10 2月, 2014 1 次提交
  17. 07 2月, 2014 2 次提交
    • E
      tcp: remove 1ms offset in srtt computation · 4a5ab4e2
      Eric Dumazet 提交于
      TCP pacing depends on an accurate srtt estimation.
      
      Current srtt estimation is using jiffie resolution,
      and has an artificial offset of at least 1 ms, which can produce
      slowdowns when FQ/pacing is used, especially in DC world,
      where typical rtt is below 1 ms.
      
      We are planning a switch to usec resolution for linux-3.15,
      but in the meantime, this patch removes the 1 ms offset.
      
      All we need is to have tp->srtt minimal value of 1 to differentiate
      the case of srtt being initialized or not, not 8.
      
      The problematic behavior was observed on a 40Gbit testbed,
      where 32 concurrent netperf were reaching 12Gbps of aggregate
      speed, instead of line speed.
      
      This patch also has the effect of reporting more accurate srtt and send
      rates to iproute2 ss command as in :
      
      $ ss -i dst cca2
      Netid  State      Recv-Q Send-Q          Local Address:Port
      Peer Address:Port
      tcp    ESTAB      0      0                10.244.129.1:56984
      10.244.129.2:12865
      	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
      463.4Mbps rcv_rtt:1 rcv_space:29200
      tcp    ESTAB      0      390960           10.244.129.1:60247
      10.244.129.2:50204
      	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
      send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
      Reported-by: NVytautas Valancius <valas@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a5ab4e2
    • G
      ipv4: Fix runtime WARNING in rtmsg_ifa() · 63b5f152
      Geert Uytterhoeven 提交于
      On m68k/ARAnyM:
      
      WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99()
      Modules linked in:
      CPU: 0 PID: 407 Comm: ifconfig Not tainted
      3.13.0-atari-09263-g0c71d68014d1 #1378
      Stack from 10c4fdf0:
              10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f
              00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080
              c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001
              00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002
              008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90
              eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020
      Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c
       [<00024416>] warn_slowpath_null+0x14/0x1a
       [<002501b4>] rtmsg_ifa+0xdc/0xf0
       [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2
       [<0024f9b4>] inet_abc_len+0x0/0x42
       [<00250b52>] inet_insert_ifa+0xc/0x12
       [<00252390>] devinet_ioctl+0x2ae/0x5d6
      
      Adding some debugging code reveals that net_fill_ifaddr() fails in
      
          put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                                    preferred, valid))
      
      nla_put complains:
      
          lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20
      
      Apparently commit 5c766d64 ("ipv4:
      introduce address lifetime") forgot to take into account the addition of
      struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already
      done for ipv6.
      Suggested-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63b5f152
  18. 06 2月, 2014 3 次提交
    • P
      netfilter: nf_tables: add reject module for NFPROTO_INET · 05513e9e
      Patrick McHardy 提交于
      Add a reject module for NFPROTO_INET. It does nothing but dispatch
      to the AF-specific modules based on the hook family.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      05513e9e
    • P
      netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts · cc4723ca
      Patrick McHardy 提交于
      Currently the nft_reject module depends on symbols from ipv6. This is
      wrong since no generic module should force IPv6 support to be loaded.
      Split up the module into AF-specific and a generic part.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cc4723ca
    • A
      netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report() · 829d9315
      Alexey Dobriyan 提交于
      Similar bug fixed in SIP module in 3f509c68 ("netfilter: nf_nat_sip: fix
      incorrect handling of EBUSY for RTCP expectation").
      
      BUG: unable to handle kernel paging request at 00100104
      IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
      ...
      Call Trace:
        [<c0244bd8>] ? del_timer+0x48/0x70
        [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
        [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
        [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c024442d>] call_timer_fn+0x1d/0x80
        [<c024461e>] run_timer_softirq+0x18e/0x1a0
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c023e6f3>] __do_softirq+0xa3/0x170
        [<c023e650>] ? __local_bh_enable+0x70/0x70
        <IRQ>
        [<c023e587>] ? irq_exit+0x67/0xa0
        [<c0202af6>] ? do_IRQ+0x46/0xb0
        [<c027ad05>] ? clockevents_notify+0x35/0x110
        [<c066ac6c>] ? common_interrupt+0x2c/0x40
        [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
        [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
        [<c02085f8>] ? arch_cpu_idle+0x8/0x30
        [<c027314b>] ? cpu_idle_loop+0x4b/0x140
        [<c0273258>] ? cpu_startup_entry+0x18/0x20
        [<c066056d>] ? rest_init+0x5d/0x70
        [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
        [<c081364f>] ? repair_env_string+0x5b/0x5b
        [<c0813269>] ? i386_start_kernel+0x33/0x35
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      829d9315
  19. 05 2月, 2014 1 次提交
  20. 04 2月, 2014 1 次提交
  21. 31 1月, 2014 1 次提交
  22. 28 1月, 2014 3 次提交
    • D
      net: gre: use icmp_hdr() to get inner ip header · c0c0c50f
      Duan Jiong 提交于
      When dealing with icmp messages, the skb->data points the
      ip header that triggered the sending of the icmp message.
      
      In gre_cisco_err(), the parse_gre_header() is called, and the
      iptunnel_pull_header() is called to pull the skb at the end of
      the parse_gre_header(), so the skb->data doesn't point the
      inner ip header.
      
      Unfortunately, the ipgre_err still needs those ip addresses in
      inner ip header to look up tunnel by ip_tunnel_lookup().
      
      So just use icmp_hdr() to get inner ip header instead of skb->data.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0c0c50f
    • H
      net: Fix memory leak if TPROXY used with TCP early demux · a452ce34
      Holger Eitzenberger 提交于
      I see a memory leak when using a transparent HTTP proxy using TPROXY
      together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
      
      unreferenced object 0xffff88008cba4a40 (size 1696):
        comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
        hex dump (first 32 bytes):
          0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
          02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
          [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
          [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
          [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
          [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
          [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
          [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
          [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
          [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
          [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
          [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
          [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
          [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
          [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
          [<ffffffff8127c949>] net_rx_action+0xa7/0x200
          [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
      
      But there are many more, resulting in the machine going OOM after some
      days.
      
      From looking at the TPROXY code, and with help from Florian, I see
      that the memory leak is introduced in tcp_v4_early_demux():
      
        void tcp_v4_early_demux(struct sk_buff *skb)
        {
          /* ... */
      
          iph = ip_hdr(skb);
          th = tcp_hdr(skb);
      
          if (th->doff < sizeof(struct tcphdr) / 4)
              return;
      
          sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                             iph->saddr, th->source,
                             iph->daddr, ntohs(th->dest),
                             skb->skb_iif);
          if (sk) {
              skb->sk = sk;
      
      where the socket is assigned unconditionally to skb->sk, also bumping
      the refcnt on it.  This is problematic, because in our case the skb
      has already a socket assigned in the TPROXY target.  This then results
      in the leak I see.
      
      The very same issue seems to be with IPv6, but haven't tested.
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NHolger Eitzenberger <holger@eitzenberger.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a452ce34
    • S
      net: ipv4: Use PTR_ERR_OR_ZERO · 27d79f3b
      Sachin Kamat 提交于
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27d79f3b
  23. 25 1月, 2014 1 次提交
  24. 24 1月, 2014 5 次提交
  25. 23 1月, 2014 2 次提交
    • V
      net/ipv4: queue work on power efficient wq · 906e073f
      viresh kumar 提交于
      Workqueue used in ipv4 layer have no real dependency of scheduling these on the
      cpu which scheduled them.
      
      On a idle system, it is observed that an idle cpu wakes up many times just to
      service this work. It would be better if we can schedule it on a cpu which the
      scheduler believes to be the most appropriate one.
      
      This patch replaces normal workqueues with power efficient versions. This
      doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
      enabled.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      906e073f
    • C
      tcp: metrics: Fix rcu-race when deleting multiple entries · 00ca9c5b
      Christoph Paasch 提交于
      In bbf852b9 I introduced the tmlist, which allows to delete
      multiple entries from the cache that match a specified destination if no
      source-IP is specified.
      
      However, as the cache is an RCU-list, we should not create this tmlist, as
      it will change the tcpm_next pointer of the element that will be deleted
      and so a thread iterating over the cache's entries while holding the
      RCU-lock might get "redirected" to this tmlist.
      
      This patch fixes this, by reverting back to the old behavior prior to
      bbf852b9, which means that we simply change the tcpm_next
      pointer of the previous element (pp) to jump over the one we are
      deleting.
      The difference is that we call kfree_rcu() directly on the cache entry,
      which allows us to delete multiple entries from the list.
      
      Fixes: bbf852b9 (tcp: metrics: Delete all entries matching a certain destination)
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00ca9c5b
  26. 22 1月, 2014 1 次提交
    • O
      net: Add GRO support for UDP encapsulating protocols · b582ef09
      Or Gerlitz 提交于
      Add GRO handlers for protocols that do UDP encapsulation, with the intent of
      being able to coalesce packets which encapsulate packets belonging to
      the same TCP session.
      
      For GRO purposes, the destination UDP port takes the role of the ether type
      field in the ethernet header or the next protocol in the IP header.
      
      The UDP GRO handler will only attempt to coalesce packets whose destination
      port is registered to have gro handler.
      
      Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive
      code twice on a packet. This solves the problem of udp encapsulated packets whose
      inner VM packet is udp and happen to carry a port which has registered offloads.
      Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b582ef09
反馈
建议
客服 返回
顶部