1. 03 4月, 2015 3 次提交
  2. 01 4月, 2015 5 次提交
    • J
      netlink: implement nla_get_in_addr and nla_get_in6_addr · 67b61f6c
      Jiri Benc 提交于
      Those are counterparts to nla_put_in_addr and nla_put_in6_addr.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67b61f6c
    • J
      netlink: implement nla_put_in_addr and nla_put_in6_addr · 930345ea
      Jiri Benc 提交于
      IP addresses are often stored in netlink attributes. Add generic functions
      to do that.
      
      For nla_put_in_addr, it would be nicer to pass struct in_addr but this is
      not used universally throughout the kernel, in way too many places __be32 is
      used to store IPv4 address.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      930345ea
    • J
      tcp: simplify inetpeer_addr_base use · 8f55db48
      Jiri Benc 提交于
      In many places, the a6 field is typecasted to struct in6_addr. As the
      fields are in union anyway, just add in6_addr type to the union and get rid
      of the typecasting.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f55db48
    • A
      fib_trie: Cleanup ip_fib_net_exit code path · 6e47d6ca
      Alexander Duyck 提交于
      While fixing a recent issue I noticed that we are doing some unnecessary
      work inside the loop for ip_fib_net_exit.  As such I am pulling out the
      initialization to NULL for the locally stored fib_local, fib_main, and
      fib_default.
      
      In addition I am restoring the original code for flushing the table as
      there is no need to split up the fib_table_flush and hlist_del work since
      the code for packing the tnodes with multiple key vectors was dropped.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e47d6ca
    • A
      fib_trie: Fix warning on fib4_rules_exit · ad88d051
      Alexander Duyck 提交于
      This fixes the following warning:
      
       BUG: sleeping function called from invalid context at mm/slub.c:1268
       in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
       INFO: lockdep is turned off.
       CPU: 3 PID: 6 Comm: kworker/u8:0 Tainted: G        W       4.0.0-rc5+ #895
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       Workqueue: netns cleanup_net
        0000000000000006 ffff88011953fa68 ffffffff81a203b6 000000002c3a2c39
        ffff88011952a680 ffff88011953fa98 ffffffff8109daf0 ffff8801186c6aa8
        ffffffff81fbc9e5 00000000000004f4 0000000000000000 ffff88011953fac8
       Call Trace:
        [<ffffffff81a203b6>] dump_stack+0x4c/0x65
        [<ffffffff8109daf0>] ___might_sleep+0x1c3/0x1cb
        [<ffffffff8109db70>] __might_sleep+0x78/0x80
        [<ffffffff8117a60e>] slab_pre_alloc_hook+0x31/0x8f
        [<ffffffff8117d4f6>] __kmalloc+0x69/0x14e
        [<ffffffff818ed0e1>] ? kzalloc.constprop.20+0xe/0x10
        [<ffffffff818ed0e1>] kzalloc.constprop.20+0xe/0x10
        [<ffffffff818ef622>] fib_trie_table+0x27/0x8b
        [<ffffffff818ef6bd>] fib_trie_unmerge+0x37/0x2a6
        [<ffffffff810b06e1>] ? arch_local_irq_save+0x9/0xc
        [<ffffffff818e9793>] fib_unmerge+0x2d/0xb3
        [<ffffffff818f5f56>] fib4_rule_delete+0x1f/0x52
        [<ffffffff817f1c3f>] ? fib_rules_unregister+0x30/0xb2
        [<ffffffff817f1c8b>] fib_rules_unregister+0x7c/0xb2
        [<ffffffff818f64a1>] fib4_rules_exit+0x15/0x18
        [<ffffffff818e8c0a>] ip_fib_net_exit+0x23/0xf2
        [<ffffffff818e91f8>] fib_net_exit+0x32/0x36
        [<ffffffff817c8352>] ops_exit_list+0x45/0x57
        [<ffffffff817c8d3d>] cleanup_net+0x13c/0x1cd
        [<ffffffff8108b05d>] process_one_work+0x255/0x4ad
        [<ffffffff8108af69>] ? process_one_work+0x161/0x4ad
        [<ffffffff8108b4b1>] worker_thread+0x1cd/0x2ab
        [<ffffffff8108b2e4>] ? process_scheduled_works+0x2f/0x2f
        [<ffffffff81090686>] kthread+0xd4/0xdc
        [<ffffffff8109ec8f>] ? local_clock+0x19/0x22
        [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83
        [<ffffffff81a2c0c8>] ret_from_fork+0x58/0x90
        [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83
      
      The issue was that as a part of exiting the default rules were being
      deleted which resulted in the local trie being unmerged.  By moving the
      freeing of the FIB tables up we can avoid the unmerge since there is no
      local table left when we call the fib4_rules_exit function.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad88d051
  3. 30 3月, 2015 2 次提交
  4. 26 3月, 2015 1 次提交
  5. 25 3月, 2015 7 次提交
  6. 24 3月, 2015 8 次提交
    • M
      tcp: prevent fetching dst twice in early demux code · d0c294c5
      Michal Kubeček 提交于
      On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux()
      
              struct dst_entry *dst = sk->sk_rx_dst;
      
              if (dst)
                      dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
      
      to code reading sk->sk_rx_dst twice, once for the test and once for
      the argument of ip6_dst_check() (dst_check() is inline). This allows
      ip6_dst_check() to be called with null first argument, causing a crash.
      
      Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6
      TCP early demux code.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Fixes: c7109986 ("ipv6: Early TCP socket demux")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0c294c5
    • F
      inet: fix double request socket freeing · c6973669
      Fan Du 提交于
      Eric Hugne reported following error :
      
      I'm hitting this warning on latest net-next when i try to SSH into a machine
      with eth0 added to a bridge (but i think the problem is older than that)
      
      Steps to reproduce:
      node2 ~ # brctl addif br0 eth0
      [  223.758785] device eth0 entered promiscuous mode
      node2 ~ # ip link set br0 up
      [  244.503614] br0: port 1(eth0) entered forwarding state
      [  244.505108] br0: port 1(eth0) entered forwarding state
      node2 ~ # [  251.160159] ------------[ cut here ]------------
      [  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
      [  251.162077] Modules linked in:
      [  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
      [  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
      [  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
      [  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
      [  251.167203] Call Trace:
      [  251.167533]  [<ffffffff8162ace4>] dump_stack+0x45/0x57
      [  251.168206]  [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
      [  251.169239]  [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
      [  251.170271]  [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
      [  251.171408]  [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
      [  251.172589]  [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
      [  251.173366]  [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
      [  251.174134]  [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
      [  251.174820]  [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
      [  251.175473]  [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
      [  251.176282]  [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
      [  251.177004]  [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
      [  251.177693]  [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
      [  251.178336]  [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
      [  251.179170]  [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
      [  251.179922]  [<ffffffff814f97d4>] process_backlog+0x94/0x120
      [  251.180639]  [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
      [  251.181356]  [<ffffffff81051267>] __do_softirq+0xa7/0x290
      [  251.182046]  [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
      [  251.182726]  [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
      [  251.183485]  [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
      [  251.184228]  [<ffffffff8106935e>] kthread+0xee/0x110
      [  251.184871]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
      [  251.185690]  [<ffffffff81631108>] ret_from_fork+0x58/0x90
      [  251.186385]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
      [  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
      [  259.542268] br0: port 1(eth0) entered forwarding state
      
      Remove the double calls to reqsk_put()
      
      [edumazet] :
      
      I got confused because reqsk_timer_handler() _has_ to call
      reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
      the timer handler holds a reference on req.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NErik Hugne <erik.hugne@ericsson.com>
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6973669
    • A
      fib_trie: Fix regression in handling of inflate/halve failure · b6f15f82
      Alexander Duyck 提交于
      When I updated the code to address a possible null pointer dereference in
      resize I ended up reverting an exception handling fix for the suffix length
      in the event that inflate or halve failed.  This change is meant to correct
      that by reverting the earlier fix and instead simply getting the parent
      again after inflate has been completed to avoid the possible null pointer
      issue.
      
      Fixes: ddb4b9a1 ("fib_trie: Address possible NULL pointer dereference in resize")
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6f15f82
    • E
      ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets · 26e37360
      Eric Dumazet 提交于
      tcp_v4_err() can restrict lookups to ehash table, and not to listeners.
      
      Note this patch creates the infrastructure, but this means that ICMP
      messages for request sockets are ignored until complete conversion.
      
      New tcp_req_err() helper is exported so that we can use it in IPv6
      in following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26e37360
    • E
      net: convert syn_wait_lock to a spinlock · b2827053
      Eric Dumazet 提交于
      This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.
      
      We hold syn_wait_lock for such small sections, that it makes no sense to use
      a read/write lock. A spin lock is simply faster.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2827053
    • E
      inet: remove some sk_listener dependencies · 8b929ab1
      Eric Dumazet 提交于
      listener can be source of false sharing. request sock has some
      useful information like : ireq->ir_iif, ireq->ir_num, ireq->ireq_net
      
      This patch does not solve the major problem of having to read
      sk->sk_protocol which is sharing a cache line with sk->sk_wmem_alloc.
      (This same field is read later in ip_build_and_send_pkt())
      
      One idea would be to move sk_protocol close to sk_family
      (using 8 bits instead of 16 for sk_family seems enough)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b929ab1
    • E
      inet: remove sk_listener parameter from syn_ack_timeout() · 42cb80a2
      Eric Dumazet 提交于
      It is not needed, and req->sk_listener points to the listener anyway.
      request_sock argument can be const.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42cb80a2
    • E
      inet: cache listen_sock_qlen() and read rskq_defer_accept once · 2b41fab7
      Eric Dumazet 提交于
      Cache listen_sock_qlen() to limit false sharing, and read
      rskq_defer_accept once as it might change under us.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b41fab7
  7. 23 3月, 2015 1 次提交
  8. 21 3月, 2015 4 次提交
    • E
      Revert "selinux: add a skb_owned_by() hook" · d3593b5c
      Eric Dumazet 提交于
      This reverts commit ca10b9e9.
      
      No longer needed after commit eb8895de
      ("tcp: tcp_make_synack() should use sock_wmalloc")
      
      When under SYNFLOOD, we build lot of SYNACK and hit false sharing
      because of multiple modifications done on sk_listener->sk_wmem_alloc
      
      Since tcp_make_synack() uses sock_wmalloc(), there is no need
      to call skb_set_owner_w() again, as this adds two atomic operations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3593b5c
    • J
      tcp: fix tcp fin memory accounting · d22e1537
      Josh Hunt 提交于
      tcp_send_fin() does not account for the memory it allocates properly, so
      sk_forward_alloc can be negative in cases where we've sent a FIN:
      
      ss example output (ss -amn | grep -B1 f4294):
      tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
      	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d22e1537
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  9. 19 3月, 2015 9 次提交