1. 01 9月, 2012 2 次提交
    • J
      tcp: TCP Fast Open Server - header & support functions · 10467163
      Jerry Chu 提交于
      This patch adds all the necessary data structure and support
      functions to implement TFO server side. It also documents a number
      of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
      extension MIBs.
      
      In addition, it includes the following:
      
      1. a new TCP_FASTOPEN socket option an application must call to
      supply a max backlog allowed in order to enable TFO on its listener.
      
      2. A number of key data structures:
      "fastopen_rsk" in tcp_sock - for a big socket to access its
      request_sock for retransmission and ack processing purpose. It is
      non-NULL iff 3WHS not completed.
      
      "fastopenq" in request_sock_queue - points to a per Fast Open
      listener data structure "fastopen_queue" to keep track of qlen (# of
      outstanding Fast Open requests) and max_qlen, among other things.
      
      "listener" in tcp_request_sock - to point to the original listener
      for book-keeping purpose, i.e., to maintain qlen against max_qlen
      as part of defense against IP spoofing attack.
      
      3. various data structure and functions, many in tcp_fastopen.c, to
      support server side Fast Open cookie operations, including
      /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10467163
    • A
      ipv4: Minor logic clean-up in ipv4_mtu · 98d75c37
      Alexander Duyck 提交于
      In ipv4_mtu there is some logic where we are testing for a non-zero value
      and a timer expiration, then setting the value to zero, and then testing if
      the value is zero we set it to a value based on the dst.  Instead of
      bothering with the extra steps it is easier to just cleanup the logic so
      that we set it to the dst based value if it is zero or if the timer has
      expired.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      98d75c37
  2. 31 8月, 2012 3 次提交
    • E
      ipv4: must use rcu protection while calling fib_lookup · c5ae7d41
      Eric Dumazet 提交于
      Following lockdep splat was reported by Pavel Roskin :
      
      [ 1570.586223] ===============================
      [ 1570.586225] [ INFO: suspicious RCU usage. ]
      [ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted
      [ 1570.586229] -------------------------------
      [ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage!
      [ 1570.586233]
      [ 1570.586233] other info that might help us debug this:
      [ 1570.586233]
      [ 1570.586236]
      [ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0
      [ 1570.586238] 2 locks held by Chrome_IOThread/4467:
      [ 1570.586240]  #0:  (slock-AF_INET){+.-...}, at: [<ffffffff814f2c0c>] release_sock+0x2c/0xa0
      [ 1570.586253]  #1:  (fnhe_lock){+.-...}, at: [<ffffffff815302fc>] update_or_create_fnhe+0x2c/0x270
      [ 1570.586260]
      [ 1570.586260] stack backtrace:
      [ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98
      [ 1570.586265] Call Trace:
      [ 1570.586271]  [<ffffffff810976ed>] lockdep_rcu_suspicious+0xfd/0x130
      [ 1570.586275]  [<ffffffff8153042c>] update_or_create_fnhe+0x15c/0x270
      [ 1570.586278]  [<ffffffff815305b3>] __ip_rt_update_pmtu+0x73/0xb0
      [ 1570.586282]  [<ffffffff81530619>] ip_rt_update_pmtu+0x29/0x90
      [ 1570.586285]  [<ffffffff815411dc>] inet_csk_update_pmtu+0x2c/0x80
      [ 1570.586290]  [<ffffffff81558d1e>] tcp_v4_mtu_reduced+0x2e/0xc0
      [ 1570.586293]  [<ffffffff81553bc4>] tcp_release_cb+0xa4/0xb0
      [ 1570.586296]  [<ffffffff814f2c35>] release_sock+0x55/0xa0
      [ 1570.586300]  [<ffffffff815442ef>] tcp_sendmsg+0x4af/0xf50
      [ 1570.586305]  [<ffffffff8156fc60>] inet_sendmsg+0x120/0x230
      [ 1570.586308]  [<ffffffff8156fb40>] ? inet_sk_rebuild_header+0x40/0x40
      [ 1570.586312]  [<ffffffff814f4bdd>] ? sock_update_classid+0xbd/0x3b0
      [ 1570.586315]  [<ffffffff814f4c50>] ? sock_update_classid+0x130/0x3b0
      [ 1570.586320]  [<ffffffff814ec435>] do_sock_write+0xc5/0xe0
      [ 1570.586323]  [<ffffffff814ec4a3>] sock_aio_write+0x53/0x80
      [ 1570.586328]  [<ffffffff8114bc83>] do_sync_write+0xa3/0xe0
      [ 1570.586332]  [<ffffffff8114c5a5>] vfs_write+0x165/0x180
      [ 1570.586335]  [<ffffffff8114c805>] sys_write+0x45/0x90
      [ 1570.586340]  [<ffffffff815d2722>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NPavel Roskin <proski@gnu.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5ae7d41
    • F
      net: ipv4: ipmr_expire_timer causes crash when removing net namespace · acbb219d
      Francesco Ruggeri 提交于
      When tearing down a net namespace, ipv4 mr_table structures are freed
      without first deactivating their timers. This can result in a crash in
      run_timer_softirq.
      This patch mimics the corresponding behaviour in ipv6.
      Locking and synchronization seem to be adequate.
      We are about to kfree mrt, so existing code should already make sure that
      no other references to mrt are pending or can be created by incoming traffic.
      The functions invoked here do not cause new references to mrt or other
      race conditions to be created.
      Invoking del_timer_sync guarantees that ipmr_expire_timer is inactive.
      Both ipmr_expire_process (whose completion we may have to wait in
      del_timer_sync) and mroute_clean_tables internally use mfc_unres_lock
      or other synchronizations when needed, and they both only modify mrt.
      
      Tested in Linux 3.4.8.
      Signed-off-by: NFrancesco Ruggeri <fruggeri@aristanetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acbb219d
    • P
      netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation · 3f509c68
      Pablo Neira Ayuso 提交于
      We're hitting bug while trying to reinsert an already existing
      expectation:
      
      kernel BUG at kernel/timer.c:895!
      invalid opcode: 0000 [#1] SMP
      [...]
      Call Trace:
       <IRQ>
       [<ffffffffa0069563>] nf_ct_expect_related_report+0x4a0/0x57a [nf_conntrack]
       [<ffffffff812d423a>] ? in4_pton+0x72/0x131
       [<ffffffffa00ca69e>] ip_nat_sdp_media+0xeb/0x185 [nf_nat_sip]
       [<ffffffffa00b5b9b>] set_expected_rtp_rtcp+0x32d/0x39b [nf_conntrack_sip]
       [<ffffffffa00b5f15>] process_sdp+0x30c/0x3ec [nf_conntrack_sip]
       [<ffffffff8103f1eb>] ? irq_exit+0x9a/0x9c
       [<ffffffffa00ca738>] ? ip_nat_sdp_media+0x185/0x185 [nf_nat_sip]
      
      We have to remove the RTP expectation if the RTCP expectation hits EBUSY
      since we keep trying with other ports until we succeed.
      Reported-by: NRafal Fitt <rafalf@aplusc.com.pl>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3f509c68
  3. 25 8月, 2012 1 次提交
  4. 24 8月, 2012 2 次提交
  5. 23 8月, 2012 3 次提交
    • E
      net: remove delay at device dismantle · 0115e8e3
      Eric Dumazet 提交于
      I noticed extra one second delay in device dismantle, tracked down to
      a call to dst_dev_event() while some call_rcu() are still in RCU queues.
      
      These call_rcu() were posted by rt_free(struct rtable *rt) calls.
      
      We then wait a little (but one second) in netdev_wait_allrefs() before
      kicking again NETDEV_UNREGISTER.
      
      As the call_rcu() are now completed, dst_dev_event() can do the needed
      device swap on busy dst.
      
      To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
      after a rcu_barrier(), but outside of RTNL lock.
      
      Use NETDEV_UNREGISTER_FINAL with care !
      
      Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
      
      Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
      IP cache removal.
      
      With help from Gao feng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0115e8e3
    • E
      ipv4: properly update pmtu · 9b04f350
      Eric Dumazet 提交于
      Sylvain Munault reported following info :
      
       - TCP connection get "stuck" with data in send queue when doing
         "large" transfers ( like typing 'ps ax' on a ssh connection )
       - Only happens on path where the PMTU is lower than the MTU of
         the interface
       - Is not present right after boot, it only appears 10-20min after
         boot or so. (and that's inside the _same_ TCP connection, it works
         fine at first and then in the same ssh session, it'll get stuck)
       - Definitely seems related to fragments somehow since I see a router
         sending ICMP message saying fragmentation is needed.
       - Exact same setup works fine with kernel 3.5.1
      
      Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration
      period is over.
      
      ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration,
      but dst_set_expires() does nothing because dst.expires is already set.
      
      It seems we want to set the expires field to a new value, regardless
      of prior one.
      
      With help from Julian Anastasov.
      Reported-by: NSylvain Munaut <s.munaut@whatever-company.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      CC: Julian Anastasov <ja@ssi.bg>
      Tested-by: NSylvain Munaut <s.munaut@whatever-company.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b04f350
    • J
      netfilter: remove unnecessary goto statement for error recovery · 90efbed1
      Jean Sacren 提交于
      Usually it's a good practice to use goto statement for error recovery
      when initializing the module. This approach could be an overkill if:
      
       1) there is only one fail case;
       2) success and failure use the same return statement.
      
      For a cleaner approach, remove the unnecessary goto statement and
      directly implement error recovery.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      90efbed1
  6. 22 8月, 2012 3 次提交
    • E
      ipv4: fix ip header ident selection in __ip_make_skb() · a9915a1b
      Eric Dumazet 提交于
      Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
      memory in __ip_select_ident().
      
      It turns out that __ip_make_skb() called ip_select_ident() before
      properly initializing iph->daddr.
      
      This is a bug uncovered by commit 1d861aa4 (inet: Minimize use of
      cached route inetpeer.)
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131Reported-by: NChristian Casteyde <casteyde.christian@free.fr>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9915a1b
    • C
      ipv4: Use newinet->inet_opt in inet_csk_route_child_sock() · 1a7b27c9
      Christoph Paasch 提交于
      Since 0e734419 ("ipv4: Use inet_csk_route_child_sock() in DCCP and
      TCP."), inet_csk_route_child_sock() is called instead of
      inet_csk_route_req().
      
      However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
      ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
      Thus, inside inet_csk_route_child_sock() opt is always NULL and the
      SRR-options are not respected anymore.
      Packets sent by the server won't have the correct destination-IP.
      
      This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
      inside inet_csk_route_child_sock().
      Reported-by: NLuca Boccassi <luca.boccassi@gmail.com>
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a7b27c9
    • E
      tcp: fix possible socket refcount problem · 144d56e9
      Eric Dumazet 提交于
      Commit 6f458dfb (tcp: improve latencies of timer triggered events)
      added bug leading to following trace :
      
      [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.131726]
      [ 2866.132188] =========================
      [ 2866.132281] [ BUG: held lock freed! ]
      [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
      [ 2866.132281] -------------------------
      [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
      [ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281] 4 locks held by kworker/0:1/652:
      [ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]
      [ 2866.132281] stack backtrace:
      [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
      [ 2866.132281] Call Trace:
      [ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
      [ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
      [ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
      [ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
      [ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
      [ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
      [ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
      [ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
      [ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
      [ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
      [ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
      [ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
      [ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
      [ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
      [ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
      [ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
      [ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
      [ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
      [ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
      [ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
      [ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
      [ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
      [ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
      [ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
      [ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
      [ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
      [ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
      [ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
      [ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
      [ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
      [ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
      [ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
      [ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
      [ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
      [ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
      [ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
      [ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
      [ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
      [ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
      [ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
      [ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
      [ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
      [ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
      [ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
      [ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
      [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.309689] =============================================================================
      [ 2866.310254] BUG TCP (Not tainted): Object already free
      [ 2866.310254] -----------------------------------------------------------------------------
      [ 2866.310254]
      
      The bug comes from the fact that timer set in sk_reset_timer() can run
      before we actually do the sock_hold(). socket refcount reaches zero and
      we free the socket too soon.
      
      timer handler is not allowed to reduce socket refcnt if socket is owned
      by the user, or we need to change sk_reset_timer() implementation.
      
      We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
      or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
      
      Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
      was used instead of TCP_DELACK_TIMER_DEFERRED.
      
      For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
      even if not fired from a timer.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      144d56e9
  7. 20 8月, 2012 1 次提交
  8. 16 8月, 2012 1 次提交
  9. 15 8月, 2012 5 次提交
  10. 14 8月, 2012 1 次提交
    • W
      netfilter: PTR_RET can be used · 19e303d6
      Wu Fengguang 提交于
      This quiets the coccinelle warnings:
      
      net/bridge/netfilter/ebtable_filter.c:107:1-3: WARNING: PTR_RET can be used
      net/bridge/netfilter/ebtable_nat.c:107:1-3: WARNING: PTR_RET can be used
      net/ipv6/netfilter/ip6table_filter.c:65:1-3: WARNING: PTR_RET can be used
      net/ipv6/netfilter/ip6table_mangle.c:100:1-3: WARNING: PTR_RET can be used
      net/ipv6/netfilter/ip6table_raw.c:44:1-3: WARNING: PTR_RET can be used
      net/ipv6/netfilter/ip6table_security.c:62:1-3: WARNING: PTR_RET can be used
      net/ipv4/netfilter/iptable_filter.c:72:1-3: WARNING: PTR_RET can be used
      net/ipv4/netfilter/iptable_mangle.c:107:1-3: WARNING: PTR_RET can be used
      net/ipv4/netfilter/iptable_raw.c:51:1-3: WARNING: PTR_RET can be used
      net/ipv4/netfilter/iptable_security.c:70:1-3: WARNING: PTR_RET can be used
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      19e303d6
  11. 11 8月, 2012 1 次提交
  12. 10 8月, 2012 6 次提交
    • P
      netfilter: nf_nat_sip: fix via header translation with multiple parameters · f22eb25c
      Patrick McHardy 提交于
      Via-headers are parsed beginning at the first character after the Via-address.
      When the address is translated first and its length decreases, the offset to
      start parsing at is incorrect and header parameters might be missed.
      
      Update the offset after translating the Via-address to fix this.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f22eb25c
    • P
      netfilter: nf_ct_sip: fix IPv6 address parsing · 02b69cbd
      Patrick McHardy 提交于
      Within SIP messages IPv6 addresses are enclosed in square brackets in most
      cases, with the exception of the "received=" header parameter. Currently
      the helper fails to parse enclosed addresses.
      
      This patch:
      
      - changes the SIP address parsing function to enforce square brackets
        when required, and accept them when not required but present, as
        recommended by RFC 5118.
      
      - adds a new SDP address parsing function that never accepts square
        brackets since SDP doesn't use them.
      
      With these changes, the SIP helper correctly parses all test messages
      from RFC 5118 (Session Initiation Protocol (SIP) Torture Test Messages
      for Internet Protocol Version 6 (IPv6)).
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      02b69cbd
    • E
      net: tcp: ipv6_mapped needs sk_rx_dst_set method · 63d02d15
      Eric Dumazet 提交于
      commit 5d299f3d (net: ipv6: fix TCP early demux) added a
      regression for ipv6_mapped case.
      
      [   67.422369] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   67.449678] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   92.631060] BUG: unable to handle kernel NULL pointer dereference at
      (null)
      [   92.631435] IP: [<          (null)>]           (null)
      [   92.631645] PGD 0
      [   92.631846] Oops: 0010 [#1] SMP
      [   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
      dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
      parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
      snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
      snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
      snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
      uhci_hcd
      [   92.634294] CPU 0
      [   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
      [   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
      (null)
      [   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
      [   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
      0000000000000000
      [   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
      ffff88024827ad00
      [   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
      0000000000000000
      [   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
      ffff880254735380
      [   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
      7fffffffffff0218
      [   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
      knlGS:0000000000000000
      [   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
      00000000000007f0
      [   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
      task ffff880254b8cac0)
      [   92.634294] Stack:
      [   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
      ffff880245fc7d68
      [   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
      ffff880245fc7d18
      [   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
      00000000000000ff
      [   92.634294] Call Trace:
      [   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
      [   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
      [   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
      [   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
      [   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
      [   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
      [   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
      [   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
      [   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
      [   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
      [   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
      [   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
      [   92.634294] Code:  Bad RIP value.
      [   92.634294] RIP  [<          (null)>]           (null)
      [   92.634294]  RSP <ffff880245fc7cb0>
      [   92.634294] CR2: 0000000000000000
      [   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
      [   92.649146] Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix this using inet_sk_rx_dst_set(), and export this function in case
      IPv6 is modular.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63d02d15
    • E
      ipv4: tcp: unicast_sock should not land outside of TCP stack · 3a7c384f
      Eric Dumazet 提交于
      commit be9f4a44 (ipv4: tcp: remove per net tcp_sock) added a
      selinux regression, reported and bisected by John Stultz
      
      selinux_ip_postroute_compat() expect to find a valid sk->sk_security
      pointer, but this field is NULL for unicast_sock
      
      It turns out that unicast_sock are really temporary stuff to be able
      to reuse  part of IP stack (ip_append_data()/ip_push_pending_frames())
      
      Fact is that frames sent by ip_send_unicast_reply() should be orphaned
      to not fool LSM.
      
      Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
      fake socket at all. I'll probably implement tcp_v4_send_response() to
      remove these unicast_sock in linux-3.7
      Reported-by: NJohn Stultz <johnstul@us.ibm.com>
      Bisected-by: NJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a7c384f
    • P
      net: Loopback ifindex is constant now · 1fb9489b
      Pavel Emelyanov 提交于
      As pointed out, there are places, that access net->loopback_dev->ifindex
      and after ifindex generation is made per-net this value becomes constant
      equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
      it where appropriate.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fb9489b
    • E
      time: jiffies_delta_to_clock_t() helper to the rescue · a399a805
      Eric Dumazet 提交于
      Various /proc/net files sometimes report crazy timer values, expressed
      in clock_t units.
      
      This happens when an expired timer delta (expires - jiffies) is passed
      to jiffies_to_clock_t().
      
      This function has an overflow in :
      
      return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
      
      commit cbbc719f (time: Change jiffies_to_clock_t() argument type
      to unsigned long) only got around the problem.
      
      As we cant output negative values in /proc/net/tcp without breaking
      various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
      that caps the negative delta to a 0 value.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: hank <pyu@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a399a805
  13. 09 8月, 2012 2 次提交
  14. 08 8月, 2012 1 次提交
  15. 07 8月, 2012 5 次提交
    • E
      tcp: ecn: dont delay ACKS after CE · aae06bf5
      Eric Dumazet 提交于
      While playing with CoDel and ECN marking, I discovered a
      non optimal behavior of receiver of CE (Congestion Encountered)
      segments.
      
      In pathological cases, sender has reduced its cwnd to low values,
      and receiver delays its ACK (by 40 ms).
      
      While RFC 3168 6.1.3 (The TCP Receiver) doesn't explicitly recommend
      to send immediate ACKS, we believe its better to not delay ACKS, because
      a CE segment should give same signal than a dropped segment, and its
      quite important to reduce RTT to give ECE/CWR signals as fast as
      possible.
      
      Note we already call tcp_enter_quickack_mode() from TCP_ECN_check_ce()
      if we receive a retransmit, for the same reason.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aae06bf5
    • E
      net: tcp: GRO should be ECN friendly · a9e050f4
      Eric Dumazet 提交于
      While doing TCP ECN tests, I discovered GRO was reordering packets if it
      receives one packet with CE set, while previous packets in same NAPI run
      have ECT(0) for the same flow :
      
      09:25:25.857620 IP (tos 0x2,ECT(0), ttl 64, id 27893, offset 0, flags
      [DF], proto TCP (6), length 4396)
          172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
      233801:238145, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
      1990627], length 4344
      
      09:25:25.857626 IP (tos 0x3,CE, ttl 64, id 27892, offset 0, flags [DF],
      proto TCP (6), length 1500)
          172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
      232353:233801, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
      1990627], length 1448
      
      09:25:25.857638 IP (tos 0x0, ttl 64, id 34581, offset 0, flags [DF],
      proto TCP (6), length 64)
          172.30.42.13.44139 > 172.30.42.19.54550: Flags [.], cksum 0xac8f
      (incorrect -> 0xca69), ack 232353, win 1271, options [nop,nop,TS val
      1990627 ecr 3397779,nop,nop,sack 1 {233801:238145}], length 0
      
      We have two problems here :
      
      1) GRO reorders packets
      
        If NIC gave packet1, then packet2, which happen to be from "different
      flows"  GRO feeds stack with packet2, then packet1. I have yet to
      understand how to solve this problem.
      
      2) GRO is not ECN friendly
      
      Delivering packets out of order makes TCP stack not as fast as it could
      be.
      
      In this patch I suggest we make the tos test not part of the 'same_flow'
      determination, but part of the 'should flush' logic
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9e050f4
    • E
      net: ipv6: fix TCP early demux · 5d299f3d
      Eric Dumazet 提交于
      IPv6 needs a cookie in dst_check() call.
      
      We need to add rx_dst_cookie and provide a family independent
      sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d299f3d
    • V
      ip: fix error handling in ip_finish_output2() · 9871f1ad
      Vasiliy Kulikov 提交于
      __neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
      But the caller expects it to return either a pointer or NULL.  Replace
      the NULL check with IS_ERR() check.
      
      The bug was introduced in a263b309
      ("ipv4: Make neigh lookups directly in output packet path.").
      Signed-off-by: NVasily Kulikov <segoon@openwall.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9871f1ad
    • S
      tcp_output: fix sparse warning for tcp_wfree · 8e7dfbc8
      Silviu-Mihai Popescu 提交于
      Fix sparse warning:
      	* symbol 'tcp_wfree' was not declared. Should it be static?
      Signed-off-by: NSilviu-Mihai Popescu <silviupopescu1990@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e7dfbc8
  16. 04 8月, 2012 2 次提交
    • E
      ipv4: Introduce IN_DEV_NET_ROUTE_LOCALNET · 9eb43e76
      Eric Dumazet 提交于
      performance profiles show a high cost in the IN_DEV_ROUTE_LOCALNET()
      call done in ip_route_input_slow(), because of multiple dereferences,
      even if cache lines are clean and available in cpu caches.
      
      Since we already have the 'net' pointer, introduce
      IN_DEV_NET_ROUTE_LOCALNET() macro avoiding two dereferences
      (dev_net(in_dev->dev))
      
      Also change the tests to use IN_DEV_NET_ROUTE_LOCALNET() only if saddr
      or/and daddr are loopback addresse.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9eb43e76
    • E
      ipv4: change inet_addr_hash() · 40384999
      Eric Dumazet 提交于
      Use net_hash_mix(net) instead of hash_ptr(net, 8), and use
      hash_32() instead of using a serie of XOR
      
      Define IN4_ADDR_HSIZE_SHIFT for clarity
      
      __ip_dev_find() can perform the net_eq() call only if ifa_local
      matches the key, to avoid unneeded dereferences.
      
      remove inline attributes
      
      # size net/ipv4/devinet.o.before net/ipv4/devinet.o
         text	   data	    bss	    dec	    hex	filename
        17471	   2545	   2048	  22064	   5630	net/ipv4/devinet.o.before
        17335	   2545	   2048	  21928	   55a8	net/ipv4/devinet.o
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40384999
  17. 02 8月, 2012 1 次提交