1. 06 2月, 2014 4 次提交
    • P
      netfilter: nf_tables: fix potential oops when dumping sets · ec2c9935
      Patrick McHardy 提交于
      Commit c9c8e485 (netfilter: nf_tables: dump sets in all existing families)
      changed nft_ctx_init_from_setattr() to only look up the address family if it
      is not NFPROTO_UNSPEC. However if it is NFPROTO_UNSPEC and a table attribute
      is given, nftables_afinfo_lookup() will dereference the NULL afi pointer.
      
      Fix by checking for non-NULL afi and also move a check added by that commit
      to the proper position.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ec2c9935
    • P
      netfilter: nf_tables: fix overrun in nf_tables_set_alloc_name() · 53b70287
      Patrick McHardy 提交于
      The map that is used to allocate anonymous sets is indeed
      BITS_PER_BYTE * PAGE_SIZE long.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      53b70287
    • P
      netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt · e53376be
      Pablo Neira Ayuso 提交于
      With this patch, the conntrack refcount is initially set to zero and
      it is bumped once it is added to any of the list, so we fulfill
      Eric's golden rule which is that all released objects always have a
      refcount that equals zero.
      
      Andrey Vagin reports that nf_conntrack_free can't be called for a
      conntrack with non-zero ref-counter, because it can race with
      nf_conntrack_find_get().
      
      A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
      ref-counter says that this conntrack is used. So when we release
      a conntrack with non-zero counter, we break this assumption.
      
      CPU1                                    CPU2
      ____nf_conntrack_find()
                                              nf_ct_put()
                                               destroy_conntrack()
                                              ...
                                              init_conntrack
                                               __nf_conntrack_alloc (set use = 1)
      atomic_inc_not_zero(&ct->use) (use = 2)
                                               if (!l4proto->new(ct, skb, dataoff, timeouts))
                                                nf_conntrack_free(ct); (use = 2 !!!)
                                              ...
                                              __nf_conntrack_alloc (set use = 1)
       if (!nf_ct_key_equal(h, tuple, zone))
        nf_ct_put(ct); (use = 0)
         destroy_conntrack()
                                              /* continue to work with CT */
      
      After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
      race in nf_conntrack_find_get" another bug was triggered in
      destroy_conntrack():
      
      <4>[67096.759334] ------------[ cut here ]------------
      <2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
      ...
      <4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G         C ---------------    2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
      <4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>]  [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
      <4>[67096.760255] Call Trace:
      <4>[67096.760255]  [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
      <4>[67096.760255]  [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
      <4>[67096.760255]  [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
      <4>[67096.760255]  [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
      <4>[67096.760255]  [<ffffffff81484419>] nf_iterate+0x69/0xb0
      <4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
      <4>[67096.760255]  [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
      <4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
      <4>[67096.760255]  [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
      <4>[67096.760255]  [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
      <4>[67096.760255]  [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
      <4>[67096.760255]  [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
      <4>[67096.760255]  [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
      <4>[67096.760255]  [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
      <4>[67096.760255]  [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
      <4>[67096.760255]  [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff814457c9>] sys_sendto+0x139/0x190
      <4>[67096.760255]  [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
      <4>[67096.760255]  [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
      <4>[67096.760255]  [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
      <4>[67096.760255]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
      
      I have reused the original title for the RFC patch that Andrey posted and
      most of the original patch description.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Andrew Vagin <avagin@parallels.com>
      Cc: Florian Westphal <fw@strlen.de>
      Reported-by: NAndrew Vagin <avagin@parallels.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAndrew Vagin <avagin@parallels.com>
      e53376be
    • A
      netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report() · 829d9315
      Alexey Dobriyan 提交于
      Similar bug fixed in SIP module in 3f509c68 ("netfilter: nf_nat_sip: fix
      incorrect handling of EBUSY for RTCP expectation").
      
      BUG: unable to handle kernel paging request at 00100104
      IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
      ...
      Call Trace:
        [<c0244bd8>] ? del_timer+0x48/0x70
        [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
        [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
        [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c024442d>] call_timer_fn+0x1d/0x80
        [<c024461e>] run_timer_softirq+0x18e/0x1a0
        [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
        [<c023e6f3>] __do_softirq+0xa3/0x170
        [<c023e650>] ? __local_bh_enable+0x70/0x70
        <IRQ>
        [<c023e587>] ? irq_exit+0x67/0xa0
        [<c0202af6>] ? do_IRQ+0x46/0xb0
        [<c027ad05>] ? clockevents_notify+0x35/0x110
        [<c066ac6c>] ? common_interrupt+0x2c/0x40
        [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
        [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
        [<c02085f8>] ? arch_cpu_idle+0x8/0x30
        [<c027314b>] ? cpu_idle_loop+0x4b/0x140
        [<c0273258>] ? cpu_startup_entry+0x18/0x20
        [<c066056d>] ? rest_init+0x5d/0x70
        [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
        [<c081364f>] ? repair_env_string+0x5b/0x5b
        [<c0813269>] ? i386_start_kernel+0x33/0x35
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      829d9315
  2. 05 2月, 2014 3 次提交
    • A
      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get · c6825c09
      Andrey Vagin 提交于
      Lets look at destroy_conntrack:
      
      hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
      ...
      nf_conntrack_free(ct)
      	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
      
      net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
      
      The hash is protected by rcu, so readers look up conntracks without
      locks.
      A conntrack is removed from the hash, but in this moment a few readers
      still can use the conntrack. Then this conntrack is released and another
      thread creates conntrack with the same address and the equal tuple.
      After this a reader starts to validate the conntrack:
      * It's not dying, because a new conntrack was created
      * nf_ct_tuple_equal() returns true.
      
      But this conntrack is not initialized yet, so it can not be used by two
      threads concurrently. In this case BUG_ON may be triggered from
      nf_nat_setup_info().
      
      Florian Westphal suggested to check the confirm bit too. I think it's
      right.
      
      task 1			task 2			task 3
      			nf_conntrack_find_get
      			 ____nf_conntrack_find
      destroy_conntrack
       hlist_nulls_del_rcu
       nf_conntrack_free
       kmem_cache_free
      						__nf_conntrack_alloc
      						 kmem_cache_alloc
      						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
      			 if (nf_ct_is_dying(ct))
      			 if (!nf_ct_tuple_equal()
      
      I'm not sure, that I have ever seen this race condition in a real life.
      Currently we are investigating a bug, which is reproduced on a few nodes.
      In our case one conntrack is initialized from a few tasks concurrently,
      we don't have any other explanation for this.
      
      <2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
      ...
      <4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
      ...
      <4>[46267.085549] Call Trace:
      <4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
      <4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
      <4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
      <4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
      <4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
      <4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
      <4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
      <4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
      <4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
      <4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
      <4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
      <4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
      <4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
      <4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
      <4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
      <4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
      <4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
      <4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
      <4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
      <4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
      <4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
      <4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
      <4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
      <4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
      <4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
      <4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
      <4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
      <4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
      <4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
      <1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c6825c09
    • P
      netfilter: nf_tables: fix oops when deleting a chain with references · 3dd7279f
      Patrick McHardy 提交于
      The following commands trigger an oops:
      
       # nft -i
       nft> add table filter
       nft> add chain filter input { type filter hook input priority 0; }
       nft> add chain filter test
       nft> add rule filter input jump test
       nft> delete chain filter test
      
      We need to check the chain use counter before allowing destruction since
      we might have references from sets or jump rules.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=69341Reported-by: NMatthew Ife <deleriux1@gmail.com>
      Tested-by: NMatthew Ife <deleriux1@gmail.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3dd7279f
    • A
      netfilter: nft_ct: fix unconditional dump of 'dir' attr · 2a53bfb3
      Arturo Borrero 提交于
      We want to make sure that the information that we get from the kernel can
      be reinjected without troubles. The kernel shouldn't return an attribute
      that is not required, or even prohibited.
      
      Dumping unconditionally NFTA_CT_DIRECTION could lead an application in
      userspace to interpret that the attribute was originally set, while it
      was not.
      Signed-off-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2a53bfb3
  3. 04 2月, 2014 1 次提交
  4. 28 1月, 2014 14 次提交
    • M
      net: Document promote_secondaries · d922e1cb
      Martin Schwenke 提交于
      From 038a821667f62c496f2bbae27081b1b612122a97 Mon Sep 17 00:00:00 2001
      From: Martin Schwenke <martin@meltin.net>
      Date: Tue, 28 Jan 2014 15:16:49 +1100
      Subject: [PATCH] net: Document promote_secondaries
      
      This option was added a long time ago...
      
        commit 8f937c60
        Author: Harald Welte <laforge@gnumonks.org>
        Date:   Sun May 29 20:23:46 2005 -0700
      
          [IPV4]: Primary and secondary addresses
      Signed-off-by: NMartin Schwenke <martin@meltin.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d922e1cb
    • D
      net: gre: use icmp_hdr() to get inner ip header · c0c0c50f
      Duan Jiong 提交于
      When dealing with icmp messages, the skb->data points the
      ip header that triggered the sending of the icmp message.
      
      In gre_cisco_err(), the parse_gre_header() is called, and the
      iptunnel_pull_header() is called to pull the skb at the end of
      the parse_gre_header(), so the skb->data doesn't point the
      inner ip header.
      
      Unfortunately, the ipgre_err still needs those ip addresses in
      inner ip header to look up tunnel by ip_tunnel_lookup().
      
      So just use icmp_hdr() to get inner ip header instead of skb->data.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0c0c50f
    • D
      i40e: Add missing braces to i40e_dcb_need_reconfig() · 3d9667a9
      Dave Jones 提交于
      Indentation mismatch spotted with Coverity.
      Introduced in 4e3b35b0 ("i40e: add DCB and DCBNL support")
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d9667a9
    • A
      xen-netfront: fix resource leak in netfront · cefe0078
      Annie Li 提交于
      This patch removes grant transfer releasing code from netfront, and uses
      gnttab_end_foreign_access to end grant access since
      gnttab_end_foreign_access_ref may fail when the grant entry is
      currently used for reading or writing.
      
      * clean up grant transfer code kept from old netfront(2.6.18) which grants
      pages for access/map and transfer. But grant transfer is deprecated in current
      netfront, so remove corresponding release code for transfer.
      
      * fix resource leak, release grant access (through gnttab_end_foreign_access)
      and skb for tx/rx path, use get_page to ensure page is released when grant
      access is completed successfully.
      
      Xen-blkfront/xen-tpmfront/xen-pcifront also have similar issue, but patches
      for them will be created separately.
      
      V6: Correct subject line and commit message.
      
      V5: Remove unecessary change in xennet_end_access.
      
      V4: Revert put_page in gnttab_end_foreign_access, and keep netfront change in
      single patch.
      
      V3: Changes as suggestion from David Vrabel, ensure pages are not freed untill
      grant acess is ended.
      
      V2: Improve patch comments.
      Signed-off-by: NAnnie Li <annie.li@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cefe0078
    • S
      ce60e0c4
    • H
      hyperv: Add support for physically discontinuous receive buffer · b679ef73
      Haiyang Zhang 提交于
      This will allow us to use bigger receive buffer, and prevent allocation failure
      due to fragmented memory.
      Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b679ef73
    • S
      sky2: initialize napi before registering device · 731073b9
      Stanislaw Gruszka 提交于
      There is race condition when call netif_napi_add() after
      register_netdevice(), as ->open() can be called without napi initialized
      and trigger BUG_ON() on napi_enable(), like on below messages:
      
      [    9.699863] sky2: driver version 1.30
      [    9.699960] sky2 0000:02:00.0: Yukon-2 EC Ultra chip revision 2
      [    9.700020] sky2 0000:02:00.0: irq 45 for MSI/MSI-X
      [    9.700498] ------------[ cut here ]------------
      [    9.703391] kernel BUG at include/linux/netdevice.h:501!
      [    9.703391] invalid opcode: 0000 [#1] PREEMPT SMP
      <snip>
      [    9.830018] Call Trace:
      [    9.830018]  [<fa996169>] sky2_open+0x309/0x360 [sky2]
      [    9.830018]  [<c1007210>] ? via_no_dac+0x40/0x40
      [    9.830018]  [<c1007210>] ? via_no_dac+0x40/0x40
      [    9.830018]  [<c135ed4b>] __dev_open+0x9b/0x120
      [    9.830018]  [<c1431cbe>] ? _raw_spin_unlock_bh+0x1e/0x20
      [    9.830018]  [<c135efd9>] __dev_change_flags+0x89/0x150
      [    9.830018]  [<c135f148>] dev_change_flags+0x18/0x50
      [    9.830018]  [<c13bb8e0>] devinet_ioctl+0x5d0/0x6e0
      [    9.830018]  [<c13bcced>] inet_ioctl+0x6d/0xa0
      
      To fix the problem patch changes the order of initialization.
      
      Bug report:
      https://bugzilla.kernel.org/show_bug.cgi?id=67151
      
      Reported-and-tested-by: ebrahim.azarisooreh@gmail.com
      Signed-off-by: NStanislaw Gruszka <stf_xl@wp.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      731073b9
    • H
      net: Fix memory leak if TPROXY used with TCP early demux · a452ce34
      Holger Eitzenberger 提交于
      I see a memory leak when using a transparent HTTP proxy using TPROXY
      together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
      
      unreferenced object 0xffff88008cba4a40 (size 1696):
        comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
        hex dump (first 32 bytes):
          0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
          02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
          [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
          [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
          [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
          [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
          [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
          [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
          [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
          [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
          [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
          [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
          [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
          [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
          [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
          [<ffffffff8127c949>] net_rx_action+0xa7/0x200
          [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
      
      But there are many more, resulting in the machine going OOM after some
      days.
      
      From looking at the TPROXY code, and with help from Florian, I see
      that the memory leak is introduced in tcp_v4_early_demux():
      
        void tcp_v4_early_demux(struct sk_buff *skb)
        {
          /* ... */
      
          iph = ip_hdr(skb);
          th = tcp_hdr(skb);
      
          if (th->doff < sizeof(struct tcphdr) / 4)
              return;
      
          sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                             iph->saddr, th->source,
                             iph->daddr, ntohs(th->dest),
                             skb->skb_iif);
          if (sk) {
              skb->sk = sk;
      
      where the socket is assigned unconditionally to skb->sk, also bumping
      the refcnt on it.  This is problematic, because in our case the skb
      has already a socket assigned in the TPROXY target.  This then results
      in the leak I see.
      
      The very same issue seems to be with IPv6, but haven't tested.
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NHolger Eitzenberger <holger@eitzenberger.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a452ce34
    • D
      Merge branch 'bonding' · 66dd1c07
      David S. Miller 提交于
      Veaceslav Falico says:
      
      ====================
      bonding: fix locking in bond_ab_arp_prob
      
      After the latest patches, on every call of bond_ab_arp_probe() without an
      active slave I see the following warning:
      
      [    7.912314] RTNL: assertion failed at net/core/dev.c (4494)
      ...
      [    7.922495]  [<ffffffff817acc6f>] dump_stack+0x51/0x72
      [    7.923714]  [<ffffffff8168795e>] netdev_master_upper_dev_get+0x6e/0x70
      [    7.924940]  [<ffffffff816a2a66>] rtnl_link_fill+0x116/0x260
      [    7.926143]  [<ffffffff817acc6f>] ? dump_stack+0x51/0x72
      [    7.927333]  [<ffffffff816a350c>] rtnl_fill_ifinfo+0x95c/0xb90
      [    7.928529]  [<ffffffff8167af2b>] ? __kmalloc_reserve+0x3b/0xa0
      [    7.929681]  [<ffffffff8167bfcf>] ? __alloc_skb+0x9f/0x1e0
      [    7.930827]  [<ffffffff816a3b64>] rtmsg_ifinfo+0x84/0x100
      [    7.931960]  [<ffffffffa00bca07>] bond_ab_arp_probe+0x1a7/0x370 [bonding]
      [    7.933133]  [<ffffffffa00bcd78>] bond_activebackup_arp_mon+0x1a8/0x2f0 [bonding]
      ...
      
      It happens because in bond_ab_arp_probe() we change the flags of a slave
      without holding the RTNL lock.
      
      To fix this - remove the useless curr_active_lock, RCUify it and lock RTNL
      while changing the slave's flags. Also, remove bond_ab_arp_probe() from
      under any locks in bond_ab_arp_mon().
      ====================
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66dd1c07
    • V
      bonding: restructure locking of bond_ab_arp_probe() · f2ebd477
      Veaceslav Falico 提交于
      Currently we're calling it from under RCU context, however we're using some
      functions that require rtnl to be held.
      
      Fix this by restructuring the locking - don't call it under any locks,
      aquire rcu_read_lock() if we're sending _only_ (i.e. we have the active
      slave present), and use rtnl locking otherwise - if we need to modify
      (in)active flags of a slave.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2ebd477
    • V
      bonding: RCUify bond_ab_arp_probe · 98b90f26
      Veaceslav Falico 提交于
      Currently bond_ab_arp_probe() is always called under rcu_read_lock(),
      however to work with curr_active_slave we're still holding the
      curr_slave_lock.
      
      To remove that curr_slave_lock - rcu_dereference the bond's
      curr_active_slave and use it further - so that we're sure the slave won't
      go away, and we don't care if it will change in the meanwhile.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98b90f26
    • N
      AF_PACKET: Add documentation for queue mapping fanout mode · bb9fbe2d
      Neil Horman 提交于
      Recently I added a new AF_PACKET fanout operation mode in commit
      2d36097d, but I forgot to document it.  Add PACKET_FANOUT_QM as an available mode
      in the af_packet documentation.  Applies to net-next.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Daniel Borkmann <dborkman@redhat.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb9fbe2d
    • Y
      bnx2x: More Shutdown revisions · 5f6db130
      Yuval Mintz 提交于
      Submission d9aee591 "bnx2x: Don't release PCI bars on shutdown" separated
      the PCI remove and shutdown flows, but pci_disable_device() is still
      being called on both.
      As a result, a dev_WARN_ONCE will be hit during shutdown for every bnx2x
      VF probed on a hypervisor (as its shutdown callback will be called and later
      pci_disable_sriov() will call its remove callback).
      
      This calls the pci_disable_device() only on the remove flow.
      Signed-off-by: NYuval Mintz <yuvalmin@broadcom.com>
      Signed-off-by: NAriel Elior <ariele@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f6db130
    • S
      net: ipv4: Use PTR_ERR_OR_ZERO · 27d79f3b
      Sachin Kamat 提交于
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27d79f3b
  5. 27 1月, 2014 7 次提交
  6. 26 1月, 2014 11 次提交
    • J
      um: hostfs: make functions static · 9e443bc3
      James Hogan 提交于
      The hostfs_*() callback functions are all only used within
      hostfs_kern.c, so make them static.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      9e443bc3
    • R
      um: Include generic barrier.h · 9af2452a
      Richard Weinberger 提交于
      ...to get smp_store_release().
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      9af2452a
    • R
      um: Removed unused attributes from thread_struct · 61aad98a
      Richard Weinberger 提交于
      temp_stack and mm_count have no users and can be killed.
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      61aad98a
    • L
      Merge branch 'ipmi' (ipmi patches from Corey Minyard) · b2e448ec
      Linus Torvalds 提交于
      Merge ipmi fixes from Corey Minyard:
       "Just some collected fixes for 3.14.  Nothing huge"
      
      * emailed patches from Corey Minyard <minyard@acm.org>:
        ipmi: Cleanup error return
        ipmi: fix timeout calculation when bmc is disconnected
        ipmi: use USEC_PER_SEC instead of 1000000 for more meaningful
        ipmi: remove deprecated IRQF_DISABLED
      b2e448ec
    • C
      ipmi: Cleanup error return · d02b3709
      Corey Minyard 提交于
      Return proper errors for a lot of IPMI failure cases.  Also call
      pci_disable_device when IPMI PCI devices are removed.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02b3709
    • X
      ipmi: fix timeout calculation when bmc is disconnected · e21404dc
      Xie XiuQi 提交于
      Loading ipmi_si module while bmc is disconnected, we found the timeout
      is longer than 5 secs.  Actually it takes about 3 mins and 20
      secs.(HZ=250)
      
      error message as below:
        Dec 12 19:08:59 linux kernel: IPMI BT: timeout in RD_WAIT [ ] 1 retries left
        Dec 12 19:08:59 linux kernel: BT: write 4 bytes seq=0x01 03 18 00 01
        [...]
        Dec 12 19:12:19 linux kernel: IPMI BT: timeout in RD_WAIT [ ]
        Dec 12 19:12:19 linux kernel: failed 2 retries, sending error response
        Dec 12 19:12:19 linux kernel: IPMI: BT reset (takes 5 secs)
        Dec 12 19:12:19 linux kernel: IPMI BT: flag reset [ ]
      
      Function wait_for_msg_done() use schedule_timeout_uninterruptible(1) to
      sleep 1 tick, so we should subtract jiffies_to_usecs(1) instead of 100
      usecs from timeout.
      Reported-by: NHu Shiyuan <hushiyuan@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e21404dc
    • X
      ipmi: use USEC_PER_SEC instead of 1000000 for more meaningful · ccb3368c
      Xie XiuQi 提交于
      Use USEC_PER_SEC instead of 1000000, that making the later bugfix
      more clearly.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccb3368c
    • M
      ipmi: remove deprecated IRQF_DISABLED · aa5b2bab
      Michael Opdenacker 提交于
      This patch proposes to remove the use of the IRQF_DISABLED flag
      
      It's a NOOP since 2.6.35 and it will be removed one day.
      Signed-off-by: NMichael Opdenacker <michael.opdenacker@free-electrons.com>
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa5b2bab
    • L
      Merge tag 'spi-v3.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 2d2e7d19
      Linus Torvalds 提交于
      Pull spi updates from Mark Brown:
       "A respun version of the merges for the pull request previously sent
        with a few additional fixes.  The last two merges were fixed up by
        hand since the branches have moved on and currently have the prior
        merge in them.
      
        Quite a busy release for the SPI subsystem, mostly in cleanups big and
        small scattered through the stack rather than anything else:
      
         - New driver for the Broadcom BC63xx HSSPI controller
         - Fix duplicate device registration for ACPI
         - Conversion of s3c64xx to DMAEngine (this pulls in platform and DMA
           changes upon which the transiton depends)
         - Some small optimisations to reduce the amount of time we hold locks
           in the datapath, eliminate some redundant checks and the size of a
           spi_transfer
         - Lots of fixes, cleanups and general enhancements to drivers,
           especially the rspi and Atmel drivers"
      
      * tag 'spi-v3.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (112 commits)
        spi: core: Fix transfer failure when master->transfer_one returns positive value
        spi: Correct set_cs() documentation
        spi: Clarify transfer_one() w.r.t. spi_finalize_current_transfer()
        spi: Spelling s/finised/finished/
        spi: sc18is602: Convert to use bits_per_word_mask
        spi: Remove duplicate code to set default bits_per_word setting
        spi/pxa2xx: fix compilation warning when !CONFIG_PM_SLEEP
        spi: clps711x: Add MODULE_ALIAS to support module auto-loading
        spi: rspi: Add missing clk_disable() calls in error and cleanup paths
        spi: rspi: Spelling s/transmition/transmission/
        spi: rspi: Add support for specifying CPHA/CPOL
        spi/pxa2xx: initialize DMA channels to -1 to prevent inadvertent match
        spi: rspi: Add more QSPI register documentation
        spi: rspi: Add more RSPI register documentation
        spi: rspi: Remove dependency on DMAE for SHMOBILE
        spi/s3c64xx: Correct indentation
        spi: sh: Use spi_sh_clear_bit() instead of open-coded
        spi: bitbang: Grammar s/make to make/to make/
        spi: sh-hspi: Spelling s/recive/receive/
        spi: core: Improve tx/rx_nbits check comments
        ...
      2d2e7d19
    • L
      Merge tag 'regulator-v3.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator · 15333539
      Linus Torvalds 提交于
      Pull regulator updates from Mark Brown:
       "A respin of the merges in the previous pull request with one extra
        fix.
      
        A quiet release for the regulator API, quite a large number of small
        improvements all over but other than the addition of new drivers for
        the AS3722 and MAX14577 there is nothing of substantial non-local
        impact"
      
      * tag 'regulator-v3.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (47 commits)
        regulator: pfuze100-regulator: Improve dev_info() message
        regulator: pfuze100-regulator: Fix some checkpatch complaints
        regulator: twl: Fix checkpatch issue
        regulator: core: Fix checkpatch issue
        regulator: anatop-regulator: Remove unneeded memset()
        regulator: s5m8767: Update LDO index in s5m8767-regulator.txt
        regulator: as3722: set enable time for SD0/1/6
        regulator: as3722: detect SD0 low-voltage mode
        regulator: tps62360: Fix up a pointer-integer size mismatch warning
        regulator: anatop-regulator: Remove unneeded kstrdup()
        regulator: act8865: Fix build error when !OF
        regulator: act8865: register all regulators regardless of how many are used
        regulator: wm831x-dcdc: Remove unneeded 'err' label
        regulator: anatop-regulator: Add MODULE_ALIAS()
        regulator: act8865: fix incorrect devm_kzalloc for act8865
        regulator: act8865: Remove set_suspend_[en|dis]able implementation
        regulator: act8865: Remove unneeded regulator_unregister() calls
        regulator: s2mps11: Clean up redundant code
        regulator: tps65910: Simplify setting enable_mask for regulators
        regulator: act8865: add device tree binding doc
        ...
      15333539
    • L
      Merge tag 'regmap-v3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · bb1b6490
      Linus Torvalds 提交于
      Pull regmap updates from Mark Brown:
       "Nothing terribly exciting with regmap this release, mainly a few small
        extensions to allow more devices to be supported:
      
         - Allow the bulk I/O APIs to be used with no-bus regmaps
         - Support interrupt controllers with zero ack base
         - Warning and spelling fixes"
      
      * tag 'regmap-v3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
        regmap: fix a couple of typos
        regmap: Allow regmap_bulk_write() to work for "no-bus" regmaps
        regmap: Allow regmap_bulk_read() to work for "no-bus" regmaps
        regmap: irq: Allow using zero value for ack_base
        regmap: Fix 'ret' would return an uninitialized value
      bb1b6490