1. 23 8月, 2012 8 次提交
    • P
      packet: Protect packet sk list with mutex (v2) · 0fa7fa98
      Pavel Emelyanov 提交于
      Change since v1:
      
      * Fixed inuse counters access spotted by Eric
      
      In patch eea68e2f (packet: Report socket mclist info via diag module) I've
      introduced a "scheduling in atomic" problem in packet diag module -- the
      socket list is traversed under rcu_read_lock() while performed under it sk
      mclist access requires rtnl lock (i.e. -- mutex) to be taken.
      
      [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
      [152363.820573] 4 locks held by crtools/12517:
      [152363.820581]  #0:  (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
      [152363.820613]  #1:  (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
      [152363.820644]  #2:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
      [152363.820693]  #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af
      
      Similar thing was then re-introduced by further packet diag patches (fanount
      mutex and pgvec mutex for rings) :(
      
      Apart from being terribly sorry for the above, I propose to change the packet
      sk list protection from spinlock to mutex. This lock currently protects two
      modifications:
      
      * sklist
      * prot inuse counters
      
      The sklist modifications can be just reprotected with mutex since they already
      occur in a sleeping context. The inuse counters modifications are trickier -- the
      __this_cpu_-s are used inside, thus requiring the caller to handle the potential
      issues with contexts himself. Since packet sockets' counters are modified in two
      places only (packet_create and packet_release) we only need to protect the context
      from being preempted. BH disabling is not required in this case.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fa7fa98
    • D
      af_packet: use define instead of constant · 9e67030a
      danborkmann@iogearbox.net 提交于
      Instead of using a hard-coded value for the status variable, it would make
      the code more readable to use its destined define from linux/if_packet.h.
      
      Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e67030a
    • Y
      rds: Don't disable BH on BH context · bfdc587c
      Ying Xue 提交于
      Since we have already in BH context when *_write_space(),
      *_data_ready() as well as *_state_change() are called, it's
      unnecessary to disable BH.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfdc587c
    • E
      ipv6: gre: fix ip6gre_err() · b87fb39e
      Eric Dumazet 提交于
      ip6gre_err() miscomputes grehlen (sizeof(ipv6h) is 4 or 8,
      not 40 as expected), and should take into account 'offset' parameter.
      
      Also uses pskb_may_pull() to cope with some fragged skbs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dmitry Kozlov <xeb@mail.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87fb39e
    • E
      xfrm: fix RCU bugs · ef8531b6
      Eric Dumazet 提交于
      This patch reverts commit 56892261 (xfrm: Use rcu_dereference_bh to
      deference pointer protected by rcu_read_lock_bh), and fixes bugs
      introduced in commit 418a99ac ( Replace rwlock on xfrm_policy_afinfo
      with rcu )
      
      1) We properly use RCU variant in this file, not a mix of RCU/RCU_BH
      
      2) We must defer some writes after the synchronize_rcu() call or a reader
       can crash dereferencing NULL pointer.
      
      3) Now we use the xfrm_policy_afinfo_lock spinlock only from process
      context, we no longer need to block BH in xfrm_policy_register_afinfo()
      and xfrm_policy_unregister_afinfo()
      
      4) Can use RCU_INIT_POINTER() instead of rcu_assign_pointer() in
      xfrm_policy_unregister_afinfo()
      
      5) Remove a forward inline declaration (xfrm_policy_put_afinfo()),
        and also move xfrm_policy_get_afinfo() declaration.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Fan Du <fan.du@windriver.com>
      Cc: Priyanka Jain <Priyanka.Jain@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef8531b6
    • E
      net: remove delay at device dismantle · 0115e8e3
      Eric Dumazet 提交于
      I noticed extra one second delay in device dismantle, tracked down to
      a call to dst_dev_event() while some call_rcu() are still in RCU queues.
      
      These call_rcu() were posted by rt_free(struct rtable *rt) calls.
      
      We then wait a little (but one second) in netdev_wait_allrefs() before
      kicking again NETDEV_UNREGISTER.
      
      As the call_rcu() are now completed, dst_dev_event() can do the needed
      device swap on busy dst.
      
      To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
      after a rcu_barrier(), but outside of RTNL lock.
      
      Use NETDEV_UNREGISTER_FINAL with care !
      
      Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
      
      Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
      IP cache removal.
      
      With help from Gao feng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0115e8e3
    • J
      netfilter: remove unnecessary goto statement for error recovery · 90efbed1
      Jean Sacren 提交于
      Usually it's a good practice to use goto statement for error recovery
      when initializing the module. This approach could be an overkill if:
      
       1) there is only one fail case;
       2) success and failure use the same return statement.
      
      For a cleaner approach, remove the unnecessary goto statement and
      directly implement error recovery.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      90efbed1
    • M
      netfilter: replace list_for_each_continue_rcu with new interface · 6705e867
      Michael Wang 提交于
      This patch replaces list_for_each_continue_rcu() with
      list_for_each_entry_continue_rcu() to allow removing
      list_for_each_continue_rcu().
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6705e867
  2. 22 8月, 2012 4 次提交
    • E
      af_netlink: force credentials passing [CVE-2012-3520] · e0e3cea4
      Eric Dumazet 提交于
      Pablo Neira Ayuso discovered that avahi and
      potentially NetworkManager accept spoofed Netlink messages because of a
      kernel bug.  The kernel passes all-zero SCM_CREDENTIALS ancillary data
      to the receiver if the sender did not provide such data, instead of not
      including any such data at all or including the correct data from the
      peer (as it is the case with AF_UNIX).
      
      This bug was introduced in commit 16e57262
      (af_unix: dont send SCM_CREDENTIALS by default)
      
      This patch forces passing credentials for netlink, as
      before the regression.
      
      Another fix would be to not add SCM_CREDENTIALS in
      netlink messages if not provided by the sender, but it
      might break some programs.
      
      With help from Florian Weimer & Petr Matousek
      
      This issue is designated as CVE-2012-3520
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0e3cea4
    • E
      ipv4: fix ip header ident selection in __ip_make_skb() · a9915a1b
      Eric Dumazet 提交于
      Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
      memory in __ip_select_ident().
      
      It turns out that __ip_make_skb() called ip_select_ident() before
      properly initializing iph->daddr.
      
      This is a bug uncovered by commit 1d861aa4 (inet: Minimize use of
      cached route inetpeer.)
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131Reported-by: NChristian Casteyde <casteyde.christian@free.fr>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9915a1b
    • C
      ipv4: Use newinet->inet_opt in inet_csk_route_child_sock() · 1a7b27c9
      Christoph Paasch 提交于
      Since 0e734419 ("ipv4: Use inet_csk_route_child_sock() in DCCP and
      TCP."), inet_csk_route_child_sock() is called instead of
      inet_csk_route_req().
      
      However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
      ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
      Thus, inside inet_csk_route_child_sock() opt is always NULL and the
      SRR-options are not respected anymore.
      Packets sent by the server won't have the correct destination-IP.
      
      This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
      inside inet_csk_route_child_sock().
      Reported-by: NLuca Boccassi <luca.boccassi@gmail.com>
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a7b27c9
    • E
      tcp: fix possible socket refcount problem · 144d56e9
      Eric Dumazet 提交于
      Commit 6f458dfb (tcp: improve latencies of timer triggered events)
      added bug leading to following trace :
      
      [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.131726]
      [ 2866.132188] =========================
      [ 2866.132281] [ BUG: held lock freed! ]
      [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
      [ 2866.132281] -------------------------
      [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
      [ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281] 4 locks held by kworker/0:1/652:
      [ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]
      [ 2866.132281] stack backtrace:
      [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
      [ 2866.132281] Call Trace:
      [ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
      [ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
      [ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
      [ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
      [ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
      [ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
      [ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
      [ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
      [ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
      [ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
      [ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
      [ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
      [ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
      [ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
      [ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
      [ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
      [ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
      [ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
      [ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
      [ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
      [ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
      [ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
      [ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
      [ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
      [ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
      [ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
      [ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
      [ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
      [ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
      [ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
      [ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
      [ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
      [ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
      [ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
      [ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
      [ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
      [ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
      [ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
      [ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
      [ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
      [ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
      [ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
      [ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
      [ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
      [ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
      [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.309689] =============================================================================
      [ 2866.310254] BUG TCP (Not tainted): Object already free
      [ 2866.310254] -----------------------------------------------------------------------------
      [ 2866.310254]
      
      The bug comes from the fact that timer set in sk_reset_timer() can run
      before we actually do the sock_hold(). socket refcount reaches zero and
      we free the socket too soon.
      
      timer handler is not allowed to reduce socket refcnt if socket is owned
      by the user, or we need to change sk_reset_timer() implementation.
      
      We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
      or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
      
      Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
      was used instead of TCP_DELACK_TIMER_DEFERRED.
      
      For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
      even if not fired from a timer.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      144d56e9
  3. 20 8月, 2012 19 次提交
  4. 17 8月, 2012 7 次提交
  5. 16 8月, 2012 2 次提交
    • P
      netfilter: nf_ct_expect: fix possible access to uninitialized timer · 2614f864
      Pablo Neira Ayuso 提交于
      In __nf_ct_expect_check, the function refresh_timer returns 1
      if a matching expectation is found and its timer is successfully
      refreshed. This results in nf_ct_expect_related returning 0.
      Note that at this point:
      
      - the passed expectation is not inserted in the expectation table
        and its timer was not initialized, since we have refreshed one
        matching/existing expectation.
      
      - nf_ct_expect_alloc uses kmem_cache_alloc, so the expectation
        timer is in some undefined state just after the allocation,
        until it is appropriately initialized.
      
      This can be a problem for the SIP helper during the expectation
      addition:
      
       ...
       if (nf_ct_expect_related(rtp_exp) == 0) {
               if (nf_ct_expect_related(rtcp_exp) != 0)
                       nf_ct_unexpect_related(rtp_exp);
       ...
      
      Note that nf_ct_expect_related(rtp_exp) may return 0 for the timer refresh
      case that is detailed above. Then, if nf_ct_unexpect_related(rtcp_exp)
      returns != 0, nf_ct_unexpect_related(rtp_exp) is called, which does:
      
       spin_lock_bh(&nf_conntrack_lock);
       if (del_timer(&exp->timeout)) {
               nf_ct_unlink_expect(exp);
               nf_ct_expect_put(exp);
       }
       spin_unlock_bh(&nf_conntrack_lock);
      
      Note that del_timer always returns false if the timer has been
      initialized.  However, the timer was not initialized since setup_timer
      was not called, therefore, the expectation timer remains in some
      undefined state. If I'm not missing anything, this may lead to the
      removal an unexistent expectation.
      
      To fix this, the optimization that allows refreshing an expectation
      is removed. Now nf_conntrack_expect_related looks more consistent
      to me since it always add the expectation in case that it returns
      success.
      
      Thanks to Patrick McHardy for participating in the discussion of
      this patch.
      
      I think this may be the source of the problem described by:
      http://marc.info/?l=netfilter-devel&m=134073514719421&w=2Reported-by: NRafal Fitt <rafalf@aplusc.com.pl>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2614f864
    • M
      net: fix info leak in compat dev_ifconf() · 43da5f2e
      Mathias Krause 提交于
      The implementation of dev_ifconf() for the compat ioctl interface uses
      an intermediate ifc structure allocated in userland for the duration of
      the syscall. Though, it fails to initialize the padding bytes inserted
      for alignment and that for leaks four bytes of kernel stack. Add an
      explicit memset(0) before filling the structure to avoid the info leak.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43da5f2e