1. 15 4月, 2014 4 次提交
    • V
      net: Start with correct mac_len in skb_network_protocol · 1e785f48
      Vlad Yasevich 提交于
      Sometimes, when the packet arrives at skb_mac_gso_segment()
      its skb->mac_len already accounts for some of the mac lenght
      headers in the packet.  This seems to happen when forwarding
      through and OpenSSL tunnel.
      
      When we start looking for any vlan headers in skb_network_protocol()
      we seem to ignore any of the already known mac headers and start
      with an ETH_HLEN.  This results in an incorrect offset, dropped
      TSO frames and general slowness of the connection.
      
      We can start counting from the known skb->mac_len
      and return at least that much if all mac level headers
      are known and accounted for.
      
      Fixes: 53d6471c (net: Account for all vlan headers in skb_mac_gso_segment)
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Daniel Borkman <dborkman@redhat.com>
      Tested-by: NMartin Filip <nexus+kernel@smoula.net>
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e785f48
    • D
      Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204
      Daniel Borkmann 提交于
      This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
      to reflect real state of the receiver's buffer") as it introduced a
      serious performance regression on SCTP over IPv4 and IPv6, though a not
      as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
      
      Current state:
      
      [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 17:56:21 GMT
      Connecting to host 192.168.241.3, port 5201
            Cookie: Lab200slot2.1397238981.812898.548918
      [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
      [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
      [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
      [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
      [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
      [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
      [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
      [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
      [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
      [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
      [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
      [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
      [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
      (etc)
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 19:08:41 GMT
      Connecting to host 2001:db8:0:f101::1, port 5201
            Cookie: Lab200slot2.1397243321.714295.2b3f7c
      [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
      [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
      [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
      [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
      [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
      [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
      [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
      [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
      [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
      [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
      [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
      [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
      (etc)
      
      After patch:
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
      Time: Mon, 14 Apr 2014 16:40:48 GMT
      Connecting to host 192.168.240.3, port 5201
            Cookie: Lab200slot2.1397493648.413274.65e131
      [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
      [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
      [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
      [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
      
      With the reverted patch applied, the SCTP/IPv4 performance is back
      to normal on latest upstream for IPv4 and IPv6 and has same throughput
      as 3.4.2 test kernel, steady and interval reports are smooth again.
      
      Fixes: ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
      Reported-by: NPeter Butler <pbutler@sonusnet.com>
      Reported-by: NDongsheng Song <dongsheng.song@gmail.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NPeter Butler <pbutler@sonusnet.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      362d5204
    • D
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann 提交于
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c482cdc
    • E
      ipv6: Limit mtu to 65575 bytes · 30f78d8e
      Eric Dumazet 提交于
      Francois reported that setting big mtu on loopback device could prevent
      tcp sessions making progress.
      
      We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets.
      
      We must limit the IPv6 MTU to (65535 + 40) bytes in theory.
      
      Tested:
      
      ifconfig lo mtu 70000
      netperf -H ::1
      
      Before patch : Throughput :   0.05 Mbits
      
      After patch : Throughput : 35484 Mbits
      Reported-by: NFrancois WELLENREITER <f.wellenreiter@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f78d8e
  2. 14 4月, 2014 5 次提交
    • P
      netfilter: nf_tables: fix nft_cmp_fast failure on big endian for size < 4 · b855d416
      Patrick McHardy 提交于
      nft_cmp_fast is used for equality comparisions of size <= 4. For
      comparisions of size < 4 byte a mask is calculated that is applied to
      both the data from userspace (during initialization) and the register
      value (during runtime). Both values are stored using (in effect) memcpy
      to a memory area that is then interpreted as u32 by nft_cmp_fast.
      
      This works fine on little endian since smaller types have the same base
      address, however on big endian this is not true and the smaller types
      are interpreted as a big number with trailing zero bytes.
      
      The mask therefore must not include the lower bytes, but the higher bytes
      on big endian. Add a helper function that does a cpu_to_le32 to switch
      the bytes on big endian. Since we're dealing with a mask of just consequitive
      bits, this works out fine.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b855d416
    • A
      netfilter: nf_conntrack: initialize net.ct.generation · ee214d54
      Andrey Vagin 提交于
      [  251.920788] INFO: trying to register non-static key.
      [  251.921386] the code is fine but needs lockdep annotation.
      [  251.921386] turning off the locking correctness validator.
      [  251.921386] CPU: 2 PID: 15715 Comm: socket_listen Not tainted 3.14.0+ #294
      [  251.921386] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  251.921386]  0000000000000000 000000009d18c210 ffff880075f039b8 ffffffff816b7ecd
      [  251.921386]  ffffffff822c3b10 ffff880075f039c8 ffffffff816b36f4 ffff880075f03aa0
      [  251.921386]  ffffffff810c65ff ffffffff810c4a85 00000000fffffe01 ffffffffa0075172
      [  251.921386] Call Trace:
      [  251.921386]  [<ffffffff816b7ecd>] dump_stack+0x45/0x56
      [  251.921386]  [<ffffffff816b36f4>] register_lock_class.part.24+0x38/0x3c
      [  251.921386]  [<ffffffff810c65ff>] __lock_acquire+0x168f/0x1b40
      [  251.921386]  [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
      [  251.921386]  [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
      [  251.921386]  [<ffffffff816c1215>] ? _raw_spin_unlock_bh+0x35/0x40
      [  251.921386]  [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
      [  251.921386]  [<ffffffff810c7272>] lock_acquire+0xa2/0x120
      [  251.921386]  [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
      [  251.921386]  [<ffffffffa0055989>] __nf_conntrack_confirm+0x129/0x410 [nf_conntrack]
      [  251.921386]  [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
      [  251.921386]  [<ffffffffa008ab90>] ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
      [  251.921386]  [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
      [  251.921386]  [<ffffffff815d8c5a>] nf_iterate+0xaa/0xc0
      [  251.921386]  [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
      [  251.921386]  [<ffffffff815d8d14>] nf_hook_slow+0xa4/0x190
      [  251.921386]  [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
      [  251.921386]  [<ffffffff815e98f2>] ip_output+0x92/0x100
      [  251.921386]  [<ffffffff815e8df9>] ip_local_out+0x29/0x90
      [  251.921386]  [<ffffffff815e9240>] ip_queue_xmit+0x170/0x4c0
      [  251.921386]  [<ffffffff815e90d5>] ? ip_queue_xmit+0x5/0x4c0
      [  251.921386]  [<ffffffff81601208>] tcp_transmit_skb+0x498/0x960
      [  251.921386]  [<ffffffff81602d82>] tcp_connect+0x812/0x960
      [  251.921386]  [<ffffffff810e3dc5>] ? ktime_get_real+0x25/0x70
      [  251.921386]  [<ffffffff8159ea2a>] ? secure_tcp_sequence_number+0x6a/0xc0
      [  251.921386]  [<ffffffff81606f57>] tcp_v4_connect+0x317/0x470
      [  251.921386]  [<ffffffff8161f645>] __inet_stream_connect+0xb5/0x330
      [  251.921386]  [<ffffffff8158dfc3>] ? lock_sock_nested+0x33/0xa0
      [  251.921386]  [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
      [  251.921386]  [<ffffffff81078885>] ? __local_bh_enable_ip+0x75/0xe0
      [  251.921386]  [<ffffffff8161f8f8>] inet_stream_connect+0x38/0x50
      [  251.921386]  [<ffffffff8158b157>] SYSC_connect+0xe7/0x120
      [  251.921386]  [<ffffffff810e3789>] ? current_kernel_time+0x69/0xd0
      [  251.921386]  [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
      [  251.921386]  [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
      [  251.921386]  [<ffffffff8158c36e>] SyS_connect+0xe/0x10
      [  251.921386]  [<ffffffff816caf69>] system_call_fastpath+0x16/0x1b
      [  312.014104] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=60003 jiffies, g=42359, c=42358, q=333)
      [  312.015097] INFO: Stall ended before state dump start
      
      Fixes: 93bb0ceb ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ee214d54
    • M
      filter: prevent nla extensions to peek beyond the end of the message · 05ab8f26
      Mathias Krause 提交于
      The BPF_S_ANC_NLATTR and BPF_S_ANC_NLATTR_NEST extensions fail to check
      for a minimal message length before testing the supplied offset to be
      within the bounds of the message. This allows the subtraction of the nla
      header to underflow and therefore -- as the data type is unsigned --
      allowing far to big offset and length values for the search of the
      netlink attribute.
      
      The remainder calculation for the BPF_S_ANC_NLATTR_NEST extension is
      also wrong. It has the minuend and subtrahend mixed up, therefore
      calculates a huge length value, allowing to overrun the end of the
      message while looking for the netlink attribute.
      
      The following three BPF snippets will trigger the bugs when attached to
      a UNIX datagram socket and parsing a message with length 1, 2 or 3.
      
       ,-[ PoC for missing size check in BPF_S_ANC_NLATTR ]--
       | ld	#0x87654321
       | ldx	#42
       | ld	#nla
       | ret	a
       `---
      
       ,-[ PoC for the same bug in BPF_S_ANC_NLATTR_NEST ]--
       | ld	#0x87654321
       | ldx	#42
       | ld	#nlan
       | ret	a
       `---
      
       ,-[ PoC for wrong remainder calculation in BPF_S_ANC_NLATTR_NEST ]--
       | ; (needs a fake netlink header at offset 0)
       | ld	#0
       | ldx	#42
       | ld	#nlan
       | ret	a
       `---
      
      Fix the first issue by ensuring the message length fulfills the minimal
      size constrains of a nla header. Fix the second bug by getting the math
      for the remainder calculation right.
      
      Fixes: 4738c1db ("[SKFILTER]: Add SKF_ADF_NLATTR instruction")
      Fixes: d214c753 ("filter: add SKF_AD_NLATTR_NEST to look for nested..")
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05ab8f26
    • J
      ipv4: return valid RTA_IIF on ip route get · 91146153
      Julian Anastasov 提交于
      Extend commit 13378cad
      ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid
      RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0
      which is displayed as 'iif *'.
      
      inet_iif is not appropriate to use because skb_iif is not set.
      Use the skb->dev->ifindex instead.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91146153
    • W
      net: ipv4: current group_info should be put after using. · b04c4619
      Wang, Xiaoming 提交于
      Plug a group_info refcount leak in ping_init.
      group_info is only needed during initialization and
      the code failed to release the reference on exit.
      While here move grabbing the reference to a place
      where it is actually needed.
      Signed-off-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NZhang Dongxing <dongxing.zhang@intel.com>
      Signed-off-by: Nxiaoming wang <xiaoming.wang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b04c4619
  3. 13 4月, 2014 2 次提交
    • N
      vti: don't allow to add the same tunnel twice · 8d89dcdf
      Nicolas Dichtel 提交于
      Before the patch, it was possible to add two times the same tunnel:
      ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
      ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41
      
      It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
      argument dev->type, which was set only later (when calling ndo_init handler
      in register_netdevice()). Let's set this type in the setup handler, which is
      called before newlink handler.
      
      Introduced by commit b9959fd3 ("vti: switch to new ip tunnel code").
      
      CC: Cong Wang <amwang@redhat.com>
      CC: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d89dcdf
    • N
      gre: don't allow to add the same tunnel twice · 5a455275
      Nicolas Dichtel 提交于
      Before the patch, it was possible to add two times the same tunnel:
      ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
      ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249
      
      It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
      argument dev->type, which was set only later (when calling ndo_init handler
      in register_netdevice()). Let's set this type in the setup handler, which is
      called before newlink handler.
      
      Introduced by commit c5441932 ("GRE: Refactor GRE tunneling code.").
      
      CC: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a455275
  4. 12 4月, 2014 4 次提交
    • D
      pktgen: be friendly to LLTX devices · 0f2eea4b
      Daniel Borkmann 提交于
      Similarly to commit 43279500 ("packet: respect devices with
      LLTX flag in direct xmit"), we can basically apply the very same
      to pktgen. This will help testing against LLTX devices such as
      dummy driver (or others), which only have a single netdevice txq
      and would otherwise require locking their txq from pktgen side
      while e.g. in dummy case, we would not need any locking. Fix this
      by making use of HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will
      be respected.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f2eea4b
    • L
      net: ipv6: Fix oif in TCP SYN+ACK route lookup. · a36dbdb2
      Lorenzo Colitti 提交于
      net-next commit 9c76a114, ipv6: tcp_ipv6 policy route issue, had
      a boolean logic error that caused incorrect behaviour for TCP
      SYN+ACK when oif-based rules are in use. Specifically:
      
      1. If a SYN comes in from a global address, and sk_bound_dev_if
         is not set, the routing lookup has oif set to the interface
         the SYN came in on. Instead, it should have oif unset,
         because for global addresses, the incoming interface doesn't
         necessarily have any bearing on the interface the SYN+ACK is
         sent out on.
      2. If a SYN comes in from a link-local address, and
         sk_bound_dev_if is set, the routing lookup has oif set to the
         interface the SYN came in on. Instead, it should have oif set
         to sk_bound_dev_if, because that's what the application
         requested.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a36dbdb2
    • D
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller 提交于
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      676d2369
    • T
      bridge: Fix double free and memory leak around br_allowed_ingress · eb707618
      Toshiaki Makita 提交于
      br_allowed_ingress() has two problems.
      
      1. If br_allowed_ingress() is called by br_handle_frame_finish() and
      vlan_untag() in br_allowed_ingress() fails, skb will be freed by both
      vlan_untag() and br_handle_frame_finish().
      
      2. If br_allowed_ingress() is called by br_dev_xmit() and
      br_allowed_ingress() fails, the skb will not be freed.
      
      Fix these two problems by freeing the skb in br_allowed_ingress()
      if it fails.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb707618
  5. 11 4月, 2014 1 次提交
  6. 10 4月, 2014 2 次提交
    • D
      l2tp: take PMTU from tunnel UDP socket · f34c4a35
      Dmitry Petukhov 提交于
      When l2tp driver tries to get PMTU for the tunnel destination, it uses
      the pointer to struct sock that represents PPPoX socket, while it
      should use the pointer that represents UDP socket of the tunnel.
      Signed-off-by: NDmitry Petukhov <dmgenp@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f34c4a35
    • D
      net: sctp: test if association is dead in sctp_wake_up_waiters · 1e1cdf8a
      Daniel Borkmann 提交于
      In function sctp_wake_up_waiters(), we need to involve a test
      if the association is declared dead. If so, we don't have any
      reference to a possible sibling association anymore and need
      to invoke sctp_write_space() instead, and normally walk the
      socket's associations and notify them of new wmem space. The
      reason for special casing is that otherwise, we could run
      into the following issue when a sctp_primitive_SEND() call
      from sctp_sendmsg() fails, and tries to flush an association's
      outq, i.e. in the following way:
      
      sctp_association_free()
      `-> list_del(&asoc->asocs)         <-- poisons list pointer
          asoc->base.dead = true
          sctp_outq_free(&asoc->outqueue)
          `-> __sctp_outq_teardown()
           `-> sctp_chunk_free()
            `-> consume_skb()
             `-> sctp_wfree()
              `-> sctp_wake_up_waiters() <-- dereferences poisoned pointers
                                             if asoc->ep->sndbuf_policy=0
      
      Therefore, only walk the list in an 'optimized' way if we find
      that the current association is still active. We could also use
      list_del_init() in addition when we call sctp_association_free(),
      but as Vlad suggests, we want to trap such bugs and thus leave
      it poisoned as is.
      
      Why is it safe to resolve the issue by testing for asoc->base.dead?
      Parallel calls to sctp_sendmsg() are protected under socket lock,
      that is lock_sock()/release_sock(). Only within that path under
      lock held, we're setting skb/chunk owner via sctp_set_owner_w().
      Eventually, chunks are freed directly by an association still
      under that lock. So when traversing association list on destruction
      time from sctp_wake_up_waiters() via sctp_wfree(), a different
      CPU can't be running sctp_wfree() while another one calls
      sctp_association_free() as both happens under the same lock.
      Therefore, this can also not race with setting/testing against
      asoc->base.dead as we are guaranteed for this to happen in order,
      under lock. Further, Vlad says: the times we check asoc->base.dead
      is when we've cached an association pointer for later processing.
      In between cache and processing, the association may have been
      freed and is simply still around due to reference counts. We check
      asoc->base.dead under a lock, so it should always be safe to check
      and not race against sctp_association_free(). Stress-testing seems
      fine now, too.
      
      Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Vlad Yasevich <vyasevic@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e1cdf8a
  7. 09 4月, 2014 1 次提交
    • D
      net: sctp: wake up all assocs if sndbuf policy is per socket · 52c35bef
      Daniel Borkmann 提交于
      SCTP charges chunks for wmem accounting via skb->truesize in
      sctp_set_owner_w(), and sctp_wfree() respectively as the
      reverse operation. If a sender runs out of wmem, it needs to
      wait via sctp_wait_for_sndbuf(), and gets woken up by a call
      to __sctp_write_space() mostly via sctp_wfree().
      
      __sctp_write_space() is being called per association. Although
      we assign sk->sk_write_space() to sctp_write_space(), which
      is then being done per socket, it is only used if send space
      is increased per socket option (SO_SNDBUF), as SOCK_USE_WRITE_QUEUE
      is set and therefore not invoked in sock_wfree().
      
      Commit 4c3a5bda ("sctp: Don't charge for data in sndbuf
      again when transmitting packet") fixed an issue where in case
      sctp_packet_transmit() manages to queue up more than sndbuf
      bytes, sctp_wait_for_sndbuf() will never be woken up again
      unless it is interrupted by a signal. However, a still
      remaining issue is that if net.sctp.sndbuf_policy=0, that is
      accounting per socket, and one-to-many sockets are in use,
      the reclaimed write space from sctp_wfree() is 'unfairly'
      handed back on the server to the association that is the lucky
      one to be woken up again via __sctp_write_space(), while
      the remaining associations are never be woken up again
      (unless by a signal).
      
      The effect disappears with net.sctp.sndbuf_policy=1, that
      is wmem accounting per association, as it guarantees a fair
      share of wmem among associations.
      
      Therefore, if we have reclaimed memory in case of per socket
      accounting, wake all related associations to a socket in a
      fair manner, that is, traverse the socket association list
      starting from the current neighbour of the association and
      issue a __sctp_write_space() to everyone until we end up
      waking ourselves. This guarantees that no association is
      preferred over another and even if more associations are
      taken into the one-to-many session, all receivers will get
      messages from the server and are not stalled forever on
      high load. This setting still leaves the advantage of per
      socket accounting in touch as an association can still use
      up global limits if unused by others.
      
      Fixes: 4eb701df ("[SCTP] Fix SCTP sendbuffer accouting.")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Vlad Yasevich <vyasevic@redhat.com>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52c35bef
  8. 08 4月, 2014 7 次提交
    • A
      netfilter: nf_conntrack: flush net_gre->keymap_list only from gre helper · 8142b227
      Andrey Vagin 提交于
      nf_ct_gre_keymap_flush() removes a nf_ct_gre_keymap object from
      net_gre->keymap_list and frees the object. But it doesn't clean
      a reference on this object from ct_pptp_info->keymap[dir].
      Then nf_ct_gre_keymap_destroy() may release the same object again.
      
      So nf_ct_gre_keymap_flush() can be called only when we are sure that
      when nf_ct_gre_keymap_destroy will not be called.
      
      nf_ct_gre_keymap is created by nf_ct_gre_keymap_add() and the right way
      to destroy it is to call nf_ct_gre_keymap_destroy().
      
      This patch marks nf_ct_gre_keymap_flush() as static, so this patch can
      break compilation of third party modules, which use
      nf_ct_gre_keymap_flush. I'm not sure this is the right way to deprecate
      this function.
      
      [  226.540793] general protection fault: 0000 [#1] SMP
      [  226.541750] Modules linked in: nf_nat_pptp nf_nat_proto_gre
      nf_conntrack_pptp nf_conntrack_proto_gre ip_gre ip_tunnel gre
      ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc xt_nat
      iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
      nf_conntrack veth tun bridge stp llc ppdev microcode joydev pcspkr
      serio_raw virtio_console virtio_balloon floppy parport_pc parport
      pvpanic i2c_piix4 virtio_net drm_kms_helper ttm ata_generic virtio_pci
      virtio_ring virtio drm i2c_core pata_acpi [last unloaded: ip_tunnel]
      [  226.541776] CPU: 0 PID: 49 Comm: kworker/u4:2 Not tainted 3.14.0-rc8+ #101
      [  226.541776] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  226.541776] Workqueue: netns cleanup_net
      [  226.541776] task: ffff8800371e0000 ti: ffff88003730c000 task.ti: ffff88003730c000
      [  226.541776] RIP: 0010:[<ffffffff81389ba9>]  [<ffffffff81389ba9>] __list_del_entry+0x29/0xd0
      [  226.541776] RSP: 0018:ffff88003730dbd0  EFLAGS: 00010a83
      [  226.541776] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800374e6c40 RCX: dead000000200200
      [  226.541776] RDX: 6b6b6b6b6b6b6b6b RSI: ffff8800371e07d0 RDI: ffff8800374e6c40
      [  226.541776] RBP: ffff88003730dbd0 R08: 0000000000000000 R09: 0000000000000000
      [  226.541776] R10: 0000000000000001 R11: ffff88003730d92e R12: 0000000000000002
      [  226.541776] R13: ffff88007a4c42d0 R14: ffff88007aef0000 R15: ffff880036cf0018
      [  226.541776] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      [  226.541776] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  226.541776] CR2: 00007f07f643f7d0 CR3: 0000000036fd2000 CR4: 00000000000006f0
      [  226.541776] Stack:
      [  226.541776]  ffff88003730dbe8 ffffffff81389c5d ffff8800374ffbe4 ffff88003730dc28
      [  226.541776]  ffffffffa0162a43 ffffffffa01627c5 ffff88007a4c42d0 ffff88007aef0000
      [  226.541776]  ffffffffa01651c0 ffff88007a4c45e0 ffff88007aef0000 ffff88003730dc40
      [  226.541776] Call Trace:
      [  226.541776]  [<ffffffff81389c5d>] list_del+0xd/0x30
      [  226.541776]  [<ffffffffa0162a43>] nf_ct_gre_keymap_destroy+0x283/0x2d0 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffffa01627c5>] ? nf_ct_gre_keymap_destroy+0x5/0x2d0 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffffa0162ab7>] gre_destroy+0x27/0x70 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffffa0117de3>] destroy_conntrack+0x83/0x200 [nf_conntrack]
      [  226.541776]  [<ffffffffa0117d87>] ? destroy_conntrack+0x27/0x200 [nf_conntrack]
      [  226.541776]  [<ffffffffa0117d60>] ? nf_conntrack_hash_check_insert+0x2e0/0x2e0 [nf_conntrack]
      [  226.541776]  [<ffffffff81630142>] nf_conntrack_destroy+0x72/0x180
      [  226.541776]  [<ffffffff816300d5>] ? nf_conntrack_destroy+0x5/0x180
      [  226.541776]  [<ffffffffa011ef80>] ? kill_l3proto+0x20/0x20 [nf_conntrack]
      [  226.541776]  [<ffffffffa011847e>] nf_ct_iterate_cleanup+0x14e/0x170 [nf_conntrack]
      [  226.541776]  [<ffffffffa011f74b>] nf_ct_l4proto_pernet_unregister+0x5b/0x90 [nf_conntrack]
      [  226.541776]  [<ffffffffa0162409>] proto_gre_net_exit+0x19/0x30 [nf_conntrack_proto_gre]
      [  226.541776]  [<ffffffff815edf89>] ops_exit_list.isra.1+0x39/0x60
      [  226.541776]  [<ffffffff815eecc0>] cleanup_net+0x100/0x1d0
      [  226.541776]  [<ffffffff810a608a>] process_one_work+0x1ea/0x4f0
      [  226.541776]  [<ffffffff810a6028>] ? process_one_work+0x188/0x4f0
      [  226.541776]  [<ffffffff810a64ab>] worker_thread+0x11b/0x3a0
      [  226.541776]  [<ffffffff810a6390>] ? process_one_work+0x4f0/0x4f0
      [  226.541776]  [<ffffffff810af42d>] kthread+0xed/0x110
      [  226.541776]  [<ffffffff8173d4dc>] ? _raw_spin_unlock_irq+0x2c/0x40
      [  226.541776]  [<ffffffff810af340>] ? kthread_create_on_node+0x200/0x200
      [  226.541776]  [<ffffffff8174747c>] ret_from_fork+0x7c/0xb0
      [  226.541776]  [<ffffffff810af340>] ? kthread_create_on_node+0x200/0x200
      [  226.541776] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de
      48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48
      39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89
      42 08
      [  226.541776] RIP  [<ffffffff81389ba9>] __list_del_entry+0x29/0xd0
      [  226.541776]  RSP <ffff88003730dbd0>
      [  226.612193] ---[ end trace 985ae23ddfcc357c ]---
      
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8142b227
    • C
      net: replace __this_cpu_inc in route.c with raw_cpu_inc · 3ed66e91
      Christoph Lameter 提交于
      The RT_CACHE_STAT_INC macro triggers the new preemption checks
      for __this_cpu ops.
      
      I do not see any other synchronization that would allow the use of a
      __this_cpu operation here however in commit dbd2915c ("[IPV4]:
      RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
      raw_smp_processor_id() here because "we do not care" about races.  In
      the past we agreed that the price of disabling interrupts here to get
      consistent counters would be too high.  These counters may be inaccurate
      due to race conditions.
      
      The use of __this_cpu op improves the situation already from what commit
      dbd2915c did since the single instruction emitted on x86 does not
      allow the race to occur anymore.  However, non x86 platforms could still
      experience a race here.
      
      Trace:
      
        __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
        caller is __this_cpu_preempt_check+0x38/0x60
        CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF            3.12.0-rc4+ #187
        Call Trace:
          check_preemption_disabled+0xec/0x110
          __this_cpu_preempt_check+0x38/0x60
          __ip_route_output_key+0x575/0x8c0
          ip_route_output_flow+0x27/0x70
          udp_sendmsg+0x825/0xa20
          inet_sendmsg+0x85/0xc0
          sock_sendmsg+0x9c/0xd0
          ___sys_sendmsg+0x37c/0x390
          __sys_sendmsg+0x49/0x90
          SyS_sendmsg+0x12/0x20
          tracesys+0xe1/0xe6
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ed66e91
    • V
      netdev: remove potentially harmful checks · 6859e7df
      Veaceslav Falico 提交于
      Currently we're checking a variable for != NULL after actually
      dereferencing it, in netdev_lower_get_next_private*().
      
      It's counter-intuitive at best, and can lead to faulty usage (as it implies
      that the variable can be NULL), so fix it by removing the useless checks.
      Reported-by: NDaniel Borkmann <dborkman@redhat.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: stephen hemminger <stephen@networkplumber.org>
      CC: Jerry Chu <hkchu@google.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6859e7df
    • D
      pktgen: fix xmit test for BQL enabled devices · 6f25cd47
      Daniel Borkmann 提交于
      Similarly as in commit 8e2f1a63 ("packet: fix packet_direct_xmit
      for BQL enabled drivers"), we test for __QUEUE_STATE_STACK_XOFF bit
      in pktgen's xmit, which would not fully fill the device's TX ring for
      BQL drivers that use netdev_tx_sent_queue(). Fix is to use, similarly
      as we do in packet sockets, netif_xmit_frozen_or_drv_stopped() test.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f25cd47
    • G
      tipc: Let tipc_release() return 0 · 065d7e39
      Geert Uytterhoeven 提交于
      net/tipc/socket.c: In function ‘tipc_release’:
      net/tipc/socket.c:352: warning: ‘res’ is used uninitialized in this function
      
      Introduced by commit 24be34b5 ("tipc:
      eliminate upcall function pointers between port and socket"), which
      removed the sole initializer of "res".
      
      Just return 0 to fix it.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065d7e39
    • J
      mac802154: fix duplicate #include headers · 6c6a9855
      Jean Sacren 提交于
      The commit e6278d92 ("mac802154: use header operations to
      create/parse headers") included the header
      
      		net/ieee802154_netdev.h
      
      which had been included by the commit b70ab2e8 ("ieee802154:
      enforce consistent endianness in the 802.15.4 stack"). Fix this
      duplicate #include by deleting the latter one as the required header
      has already been in place.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
      Cc: linux-zigbee-devel@lists.sourceforge.net
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c6a9855
    • D
      net: filter: be more defensive on div/mod by X==0 · 5f9fde5f
      Daniel Borkmann 提交于
      The old interpreter behaviour was that we returned with 0
      whenever we found a division by 0 would take place. In the new
      interpreter we would currently just skip that instead and
      continue execution.
      
      It's true that a value of 0 as return might not be appropriate
      in all cases, but current users (socket filters -> drop
      packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified,
      etc) seem fine with that behaviour. Better this than undefined
      BPF program behaviour as it's expected that A contains the
      result of the division. In future, as more use cases open up,
      we could further adapt this return value to our needs, if
      necessary.
      
      So reintroduce return of 0 for division by 0 as in the old
      interpreter. Also in case of K which is guaranteed to be 32bit
      wide, sk_chk_filter() already takes care of preventing division
      by 0 invoked through K, so we can generally spare us these tests.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Reviewed-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f9fde5f
  9. 05 4月, 2014 14 次提交