1. 21 10月, 2015 21 次提交
  2. 19 10月, 2015 5 次提交
    • S
      RDS: fix rds-ping deadlock over TCP transport · 7b4b0009
      santosh.shilimkar@oracle.com 提交于
      Sowmini found hang with rds-ping while testing RDS over TCP. Its
      a corner case and doesn't happen always. The issue is not reproducible
      with IB transport. Its clear from below dump why we see it with RDS TCP.
      
       [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740
       [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30
       [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20
       [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
       [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds]
       [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds]
       [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds]
       [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130
       [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0
       [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
       [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0
       [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
       [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0
       [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
       [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0
       [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
       [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0
       [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220
       [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880
       [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220
      
      This happens because rds_send_xmit() chain wants to take
      sock_lock which is already taken by tcp_v4_rcv() on its
      way to rds_tcp_data_ready(). Commit db6526dc ("RDS: use
      rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
      was trying to opportunistically finish the send request
      in same thread context.
      
      But because of above recursive lock hang with RDS TCP,
      the send work from rds_send_pong() needs to deferred to
      worker to avoid lock up. Given RDS ping is more of connectivity
      test than performance critical path, its should be ok even
      for transport like IB.
      Reported-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Acked-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b4b0009
    • E
      tcp: do not set queue_mapping on SYNACK · dc6ef6be
      Eric Dumazet 提交于
      At the time of commit fff32699 ("tcp: reflect SYN queue_mapping into
      SYNACK packets") we had little ways to cope with SYN floods.
      
      We no longer need to reflect incoming skb queue mappings, and instead
      can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
      affinities.
      
      Note that all SYNACK retransmits were picking TX queue 0, this no longer
      is a win given that SYNACK rtx are now distributed on all cpus.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc6ef6be
    • J
      openvswitch: Scrub skb between namespaces · 740dbc28
      Joe Stringer 提交于
      If OVS receives a packet from another namespace, then the packet should
      be scrubbed. However, people have already begun to rely on the behaviour
      that skb->mark is preserved across namespaces, so retain this one field.
      
      This is mainly to address information leakage between namespaces when
      using OVS internal ports, but by placing it in ovs_vport_receive() it is
      more generally applicable, meaning it should not be overlooked if other
      port types are allowed to be moved into namespaces in future.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740dbc28
    • A
      netlink: Trim skb to alloc size to avoid MSG_TRUNC · db65a3aa
      Arad, Ronen 提交于
      netlink_dump() allocates skb based on the calculated min_dump_alloc or
      a per socket max_recvmsg_len.
      min_alloc_size is maximum space required for any single netdev
      attributes as calculated by rtnl_calcit().
      max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
      It is capped at 16KiB.
      The intention is to avoid small allocations and to minimize the number
      of calls required to obtain dump information for all net devices.
      
      netlink_dump packs as many small messages as could fit within an skb
      that was sized for the largest single netdev information. The actual
      space available within an skb is larger than what is requested. It could
      be much larger and up to near 2x with align to next power of 2 approach.
      
      Allowing netlink_dump to use all the space available within the
      allocated skb increases the buffer size a user has to provide to avoid
      truncaion (i.e. MSG_TRUNG flag set).
      
      It was observed that with many VLANs configured on at least one netdev,
      a larger buffer of near 64KiB was necessary to avoid "Message truncated"
      error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
      min_alloc_size was only little over 32KiB.
      
      This patch trims skb to allocated size in order to allow the user to
      avoid truncation with more reasonable buffer size.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db65a3aa
    • L
      ipconfig: send Client-identifier in DHCP requests · 26fb342c
      Li RongQing 提交于
      A dhcp server may provide parameters to a client from a pool of IP
      addresses and using a shared rootfs, or provide a specific set of
      parameters for a specific client, usually using the MAC address to
      identify each client individually. The dhcp protocol also specifies
      a client-id field which can be used to determine the correct
      parameters to supply when no MAC address is available. There is
      currently no way to tell the kernel to supply a specific client-id,
      only the userspace dhcp clients support this feature, but this can
      not be used when the network is needed before userspace is available
      such as when the root filesystem is on NFS.
      
      This patch is to be able to do something like "ip=dhcp,client_id_type,
      client_id_value", as a kernel parameter to enable the kernel to
      identify itself to the server.
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26fb342c
  3. 17 10月, 2015 9 次提交
  4. 16 10月, 2015 5 次提交
    • I
      rbd: use writefull op for object size writes · e30b7577
      Ilya Dryomov 提交于
      This covers only the simplest case - an object size sized write, but
      it's still useful in tiering setups when EC is used for the base tier
      as writefull op can be proxied, saving an object promotion.
      
      Even though updating ceph_osdc_new_request() to allow writefull should
      just be a matter of fixing an assert, I didn't do it because its only
      user is cephfs.  All other sites were updated.
      
      Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e30b7577
    • J
      net: introduce pre-change upper device notifier · 573c7ba0
      Jiri Pirko 提交于
      This newly introduced netdevice notifier is called before actual change
      upper happens. That provides a possibility for notifier handlers to
      know upper change will happen and react to it, including possibility to
      forbid the change. That is valuable for drivers which can check if the
      upper device linkage is supported and forbid that in case it is not.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      573c7ba0
    • D
      net: Fix suspicious RCU usage in fib_rebalance · 51161aa9
      David Ahern 提交于
      This command:
        ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2
      
      generated this suspicious RCU usage message:
      
      [ 63.249262]
      [ 63.249939] ===============================
      [ 63.251571] [ INFO: suspicious RCU usage. ]
      [ 63.253250] 4.3.0-rc3+ #298 Not tainted
      [ 63.254724] -------------------------------
      [ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      [ 63.259450]
      [ 63.259450] other info that might help us debug this:
      [ 63.259450]
      [ 63.262297]
      [ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
      [ 63.264647] 1 lock held by ip/2870:
      [ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
      [ 63.268858]
      [ 63.268858] stack backtrace:
      [ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
      [ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
      [ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
      [ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
      [ 63.283177] Call Trace:
      [ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
      [ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
      [ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
      [ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
      [ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
      [ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
      [ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
      [ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
      [ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
      [ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
      [ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
      [ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
      [ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
      [ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
      [ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
      [ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
      [ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
      [ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
      [ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
      [ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
      [ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
      [ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
      [ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
      [ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
      [ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
      [ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It looks like all of the code paths to fib_rebalance are under rtnl.
      
      Fixes: 0e884c78 ("ipv4: L3 hash-based multipath")
      Cc: Peter Nørlund <pch@ordbogen.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51161aa9
    • E
      tcp/dccp: fix race at listener dismantle phase · ebb516af
      Eric Dumazet 提交于
      Under stress, a close() on a listener can trigger the
      WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()
      
      We need to test if listener is still active before queueing
      a child in inet_csk_reqsk_queue_add()
      
      Create a common inet_child_forget() helper, and use it
      from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb516af
    • E
      tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper · f03f2e15
      Eric Dumazet 提交于
      Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
      In many cases we also need to release reference on request socket,
      so add a helper to do this, reducing code size and complexity.
      
      Fixes: 4bdc3d66 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f03f2e15