1. 21 10月, 2015 6 次提交
  2. 19 10月, 2015 5 次提交
    • S
      RDS: fix rds-ping deadlock over TCP transport · 7b4b0009
      santosh.shilimkar@oracle.com 提交于
      Sowmini found hang with rds-ping while testing RDS over TCP. Its
      a corner case and doesn't happen always. The issue is not reproducible
      with IB transport. Its clear from below dump why we see it with RDS TCP.
      
       [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740
       [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30
       [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20
       [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
       [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds]
       [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds]
       [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds]
       [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130
       [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0
       [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
       [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0
       [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
       [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0
       [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
       [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0
       [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
       [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0
       [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220
       [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880
       [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220
      
      This happens because rds_send_xmit() chain wants to take
      sock_lock which is already taken by tcp_v4_rcv() on its
      way to rds_tcp_data_ready(). Commit db6526dc ("RDS: use
      rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
      was trying to opportunistically finish the send request
      in same thread context.
      
      But because of above recursive lock hang with RDS TCP,
      the send work from rds_send_pong() needs to deferred to
      worker to avoid lock up. Given RDS ping is more of connectivity
      test than performance critical path, its should be ok even
      for transport like IB.
      Reported-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Acked-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b4b0009
    • E
      tcp: do not set queue_mapping on SYNACK · dc6ef6be
      Eric Dumazet 提交于
      At the time of commit fff32699 ("tcp: reflect SYN queue_mapping into
      SYNACK packets") we had little ways to cope with SYN floods.
      
      We no longer need to reflect incoming skb queue mappings, and instead
      can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
      affinities.
      
      Note that all SYNACK retransmits were picking TX queue 0, this no longer
      is a win given that SYNACK rtx are now distributed on all cpus.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc6ef6be
    • J
      openvswitch: Scrub skb between namespaces · 740dbc28
      Joe Stringer 提交于
      If OVS receives a packet from another namespace, then the packet should
      be scrubbed. However, people have already begun to rely on the behaviour
      that skb->mark is preserved across namespaces, so retain this one field.
      
      This is mainly to address information leakage between namespaces when
      using OVS internal ports, but by placing it in ovs_vport_receive() it is
      more generally applicable, meaning it should not be overlooked if other
      port types are allowed to be moved into namespaces in future.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740dbc28
    • A
      netlink: Trim skb to alloc size to avoid MSG_TRUNC · db65a3aa
      Arad, Ronen 提交于
      netlink_dump() allocates skb based on the calculated min_dump_alloc or
      a per socket max_recvmsg_len.
      min_alloc_size is maximum space required for any single netdev
      attributes as calculated by rtnl_calcit().
      max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
      It is capped at 16KiB.
      The intention is to avoid small allocations and to minimize the number
      of calls required to obtain dump information for all net devices.
      
      netlink_dump packs as many small messages as could fit within an skb
      that was sized for the largest single netdev information. The actual
      space available within an skb is larger than what is requested. It could
      be much larger and up to near 2x with align to next power of 2 approach.
      
      Allowing netlink_dump to use all the space available within the
      allocated skb increases the buffer size a user has to provide to avoid
      truncaion (i.e. MSG_TRUNG flag set).
      
      It was observed that with many VLANs configured on at least one netdev,
      a larger buffer of near 64KiB was necessary to avoid "Message truncated"
      error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
      min_alloc_size was only little over 32KiB.
      
      This patch trims skb to allocated size in order to allow the user to
      avoid truncation with more reasonable buffer size.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db65a3aa
    • L
      ipconfig: send Client-identifier in DHCP requests · 26fb342c
      Li RongQing 提交于
      A dhcp server may provide parameters to a client from a pool of IP
      addresses and using a shared rootfs, or provide a specific set of
      parameters for a specific client, usually using the MAC address to
      identify each client individually. The dhcp protocol also specifies
      a client-id field which can be used to determine the correct
      parameters to supply when no MAC address is available. There is
      currently no way to tell the kernel to supply a specific client-id,
      only the userspace dhcp clients support this feature, but this can
      not be used when the network is needed before userspace is available
      such as when the root filesystem is on NFS.
      
      This patch is to be able to do something like "ip=dhcp,client_id_type,
      client_id_value", as a kernel parameter to enable the kernel to
      identify itself to the server.
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26fb342c
  3. 17 10月, 2015 9 次提交
  4. 16 10月, 2015 20 次提交
    • I
      rbd: use writefull op for object size writes · e30b7577
      Ilya Dryomov 提交于
      This covers only the simplest case - an object size sized write, but
      it's still useful in tiering setups when EC is used for the base tier
      as writefull op can be proxied, saving an object promotion.
      
      Even though updating ceph_osdc_new_request() to allow writefull should
      just be a matter of fixing an assert, I didn't do it because its only
      user is cephfs.  All other sites were updated.
      
      Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e30b7577
    • J
      net: introduce pre-change upper device notifier · 573c7ba0
      Jiri Pirko 提交于
      This newly introduced netdevice notifier is called before actual change
      upper happens. That provides a possibility for notifier handlers to
      know upper change will happen and react to it, including possibility to
      forbid the change. That is valuable for drivers which can check if the
      upper device linkage is supported and forbid that in case it is not.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      573c7ba0
    • D
      net: Fix suspicious RCU usage in fib_rebalance · 51161aa9
      David Ahern 提交于
      This command:
        ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2
      
      generated this suspicious RCU usage message:
      
      [ 63.249262]
      [ 63.249939] ===============================
      [ 63.251571] [ INFO: suspicious RCU usage. ]
      [ 63.253250] 4.3.0-rc3+ #298 Not tainted
      [ 63.254724] -------------------------------
      [ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      [ 63.259450]
      [ 63.259450] other info that might help us debug this:
      [ 63.259450]
      [ 63.262297]
      [ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
      [ 63.264647] 1 lock held by ip/2870:
      [ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
      [ 63.268858]
      [ 63.268858] stack backtrace:
      [ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
      [ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
      [ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
      [ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
      [ 63.283177] Call Trace:
      [ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
      [ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
      [ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
      [ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
      [ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
      [ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
      [ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
      [ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
      [ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
      [ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
      [ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
      [ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
      [ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
      [ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
      [ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
      [ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
      [ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
      [ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
      [ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
      [ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
      [ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
      [ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
      [ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
      [ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
      [ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
      [ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It looks like all of the code paths to fib_rebalance are under rtnl.
      
      Fixes: 0e884c78 ("ipv4: L3 hash-based multipath")
      Cc: Peter Nørlund <pch@ordbogen.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51161aa9
    • E
      tcp/dccp: fix race at listener dismantle phase · ebb516af
      Eric Dumazet 提交于
      Under stress, a close() on a listener can trigger the
      WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()
      
      We need to test if listener is still active before queueing
      a child in inet_csk_reqsk_queue_add()
      
      Create a common inet_child_forget() helper, and use it
      from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb516af
    • E
      tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper · f03f2e15
      Eric Dumazet 提交于
      Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
      In many cases we also need to release reference on request socket,
      so add a helper to do this, reducing code size and complexity.
      
      Fixes: 4bdc3d66 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f03f2e15
    • E
      Revert "inet: fix double request socket freeing" · ef84d8ce
      Eric Dumazet 提交于
      This reverts commit c6973669.
      
      At the time of above commit, tcp_req_err() and dccp_req_err()
      were dead code, as SYN_RECV request sockets were not yet in ehash table.
      
      Real bug was fixed later in a different commit.
      
      We need to revert to not leak a refcount on request socket.
      
      inet_csk_reqsk_queue_drop_and_put() will be added
      in following commit to make clean inet_csk_reqsk_queue_drop()
      does not release the reference owned by caller.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef84d8ce
    • M
      ipv6: Initialize rt6_info properly in ip6_blackhole_route() · 0a1f5962
      Martin KaFai Lau 提交于
      ip6_blackhole_route() does not initialize the newly allocated
      rt6_info properly.  This patch:
      1. Call rt6_info_init() to initialize rt6i_siblings and rt6i_uncached
      
      2. The current rt->dst._metrics init code is incorrect:
         - 'rt->dst._metrics = ort->dst._metris' is not always safe
         - Not sure what dst_copy_metrics() is trying to do here
           considering ip6_rt_blackhole_cow_metrics() always returns
           NULL
      
         Fix:
         - Always do dst_copy_metrics()
         - Replace ip6_rt_blackhole_cow_metrics() with
           dst_cow_metrics_generic()
      
      3. Mask out the RTF_PCPU bit from the newly allocated blackhole route.
         This bug triggers an oops (reported by Phil Sutter) in rt6_get_cookie().
         It is because RTF_PCPU is set while rt->dst.from is NULL.
      
      Fixes: d52d3997 ("ipv6: Create percpu rt6_info")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NPhil Sutter <phil@nwl.cc>
      Tested-by: NPhil Sutter <phil@nwl.cc>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Phil Sutter <phil@nwl.cc>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a1f5962
    • M
      ipv6: Move common init code for rt6_info to a new function rt6_info_init() · ebfa45f0
      Martin KaFai Lau 提交于
      Introduce rt6_info_init() to do the common init work for
      'struct rt6_info' (after calling dst_alloc).
      
      It is a prep work to fix the rt6_info init logic in the
      ip6_blackhole_route().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Phil Sutter <phil@nwl.cc>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebfa45f0
    • J
      Bluetooth: Fix initializing conn_params in scan phase · 5157b8a5
      Jakub Pawlowski 提交于
      This patch makes sure that conn_params that were created just for
      explicit_connect, will get properly deleted during cleanup.
      Signed-off-by: NJakub Pawlowski <jpawlowski@google.com>
      Acked-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      5157b8a5
    • J
      Bluetooth: Fix conn_params list update in hci_connect_le_scan_cleanup · 9ad3e6ff
      Johan Hedberg 提交于
      After clearing the params->explicit_connect variable the parameters
      may need to be either added back to the right list or potentially left
      absent from both the le_reports and the le_conns lists.
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      9ad3e6ff
    • J
      Bluetooth: Fix remove_device behavior for explicit connects · 679d2b6f
      Johan Hedberg 提交于
      Devices undergoing an explicit connect should not have their
      conn_params struct removed by the mgmt Remove Device command. This
      patch fixes the necessary checks in the command handler to correct the
      behavior.
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      679d2b6f
    • J
      Bluetooth: Fix LE reconnection logic · 49c50922
      Johan Hedberg 提交于
      We can't use hci_explicit_connect_lookup() since that would only cover
      explicit connections, leaving normal reconnections completely
      untouched. Not using it in turn means leaving out entries in
      pend_le_reports.
      
      To fix this and simplify the logic move conn params from the reports
      list to the pend_le_conns list for the duration of an explicit
      connect. Once the connect is complete move the params back to the
      pend_le_reports list. This also means that the explicit connect lookup
      function only needs to look into the pend_le_conns list.
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      49c50922
    • J
      Bluetooth: Fix reference counting for LE-scan based connections · b958f9a3
      Johan Hedberg 提交于
      The code should never directly call hci_conn_hash_del since many
      cleanup & reference counting updates would be lost. Normally
      hci_conn_del is the right thing to do, but in the case of a connection
      doing LE scanning this could cause a deadlock due to doing a
      cancel_delayed_work_sync() on the same work callback that we were
      called from.
      
      Connections in the LE scanning state actually need very little cleanup
      - just a small subset of hci_conn_del. To solve the issue, refactor
      out these essential pieces into a new hci_conn_cleanup() function and
      call that from the two necessary places.
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      b958f9a3
    • J
      Bluetooth: Fix double scan updates · 168b8a25
      Jakub Pawlowski 提交于
      When disable/enable scan command is issued twice, some controllers
      will return an error for the second request, i.e. requests with this
      command will fail on some controllers, and succeed on others.
      
      This patch makes sure that unnecessary scan disable/enable commands
      are not issued.
      
      When adding device to the auto connect whitelist when there is pending
      connect attempt, there is no need to update scan.
      
      hci_connect_le_scan_cleanup is conditionally executing
      hci_conn_params_del, that is calling hci_update_background_scan. Make
      the other case also update scan, and remove reduntand call from
      hci_connect_le_scan_remove.
      
      When stopping interleaved discovery the state should be set to stopped
      only when both LE scanning and discovery has stopped.
      Signed-off-by: NJakub Pawlowski <jpawlowski@google.com>
      Acked-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      168b8a25
    • J
      tipc: update node FSM when peer RESET message is received · c8199300
      Jon Paul Maloy 提交于
      The change made in the previous commit revealed a small flaw in the way
      the node FSM is updated. When the function tipc_node_link_down() is
      called for the last link to a node, we should check whether this was
      caused by a local reset or by a received RESET message from the peer.
      In the latter case, we can directly issue a PEER_LOST_CONTACT_EVT to
      the node FSM, so that it is ready to re-establish contact. If this is
      not done, the peer node will sometimes have to go through a second
      establish cycle before the link becomes stable.
      
      We fix this in this commit by conditionally issuing the mentioned
      event in the function tipc_node_link_down(). We also move LINK_RESET
      FSM even away from the link_reset() function and into the caller
      function, partially because it is easier to follow the code when state
      changes are gathered at a limited number of locations, partially
      because there will be cases in future commits where we don't want the
      link to go RESET mode when link_reset() is called.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8199300
    • J
      tipc: send out RESET immediately when link goes down · 282b3a05
      Jon Paul Maloy 提交于
      When a link is taken down because of a node local event, such as
      disabling of a bearer or an interface, we currently leave it to the
      peer node to discover the broken communication. The default time for
      such failure discovery is 1.5-2 seconds.
      
      If we instead allow the terminating link endpoint to send out a RESET
      message at the moment it is reset, we can achieve the impression that
      both endpoints are going down instantly. Since this is a very common
      scenario, we find it worthwhile to make this small modification.
      
      Apart from letting the link produce the said message, we also have to
      ensure that the interface is able to transmit it before TIPC is
      detached. We do this by performing the disabling of a bearer in three
      steps:
      
      1) Disable reception of TIPC packets from the interface in question.
      2) Take down the links, while allowing them so send out a RESET message.
      3) Disable transmission of TIPC packets on the interface.
      
      Apart from this, we now have to react on the NETDEV_GOING_DOWN event,
      instead of as currently the NEDEV_DOWN event, to ensure that such
      transmission is possible during the teardown phase.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      282b3a05
    • J
      tipc: delay ESTABLISH state event when link is established · 73f646ce
      Jon Paul Maloy 提交于
      Link establishing, just like link teardown, is a non-atomic action, in
      the sense that discovering that conditions are right to establish a link,
      and the actual adding of the link to one of the node's send slots is done
      in two different lock contexts. The link FSM is designed to help bridging
      the gap between the two contexts in a safe manner.
      
      We have now discovered a weakness in the implementaton of this FSM.
      Because we directly let the link go from state LINK_ESTABLISHING to
      state LINK_ESTABLISHED already in the first lock context, we are unable
      to distinguish between a fully established link, i.e., a link that has
      been added to its slot, and a link that has not yet reached the second
      lock context. It may hence happen that a manual intervention, e.g., when
      disabling an interface, causes the function tipc_node_link_down() to try
      removing the link from the node slots, decrementing its active link
      counter etc, although the link was never added there in the first place.
      
      We solve this by delaying the actual state change until we reach the
      second lock context, inside the function tipc_node_link_up(). This
      makes it possible for potentail callers of __tipc_node_link_down() to
      know if they should proceed or not, and the problem is solved.
      
      Unforunately, the situation described above also has a second problem.
      Since there by necessity is a tipc_node_link_up() call pending once
      the node lock has been released, we must defuse that call by setting
      the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
      us to make a slight modification to the link FSM, which will now look
      as follows.
      
       +------------------------------------+
       |RESET_EVT                           |
       |                                    |
       |                             +--------------+
       |           +-----------------|   SYNCHING   |-----------------+
       |           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
       |           |                  A            |                  |
       |           |                  |            |                  |
       |           |                  |            |                  |
       |           |                  |SYNCH_      |SYNCH_            |
       |           |                  |BEGIN_EVT   |END_EVT           |
       |           |                  |            |                  |
       |           V                  |            V                  V
       |    +-------------+          +--------------+          +------------+
       |    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
       |    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
       |           |        EVT        |    A         RESET_EVT       |
       |           |                   |    |                         |
       |           |  +----------------+    |                         |
       |  RESET_EVT|  |RESET_EVT            |                         |
       |           |  |                     |                         |
       |           |  |                     |ESTABLISH_EVT            |
       |           |  |  +-------------+    |                         |
       |           |  |  | RESET_EVT   |    |                         |
       |           |  |  |             |    |                         |
       |           V  V  V             |    |                         |
       |    +-------------+          +--------------+        RESET_EVT|
       +--->|    RESET    |--------->| ESTABLISHING |<----------------+
            +-------------+ PEER_    +--------------+
             |           A  RESET_EVT       |
             |           |                  |
             |           |                  |
             |FAILOVER_  |FAILOVER_         |FAILOVER_
             |BEGIN_EVT  |END_EVT           |BEGIN_EVT
             |           |                  |
             V           |                  |
            +-------------+                 |
            | FAILINGOVER |<----------------+
            +-------------+
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f646ce
    • J
      tipc: disallow packet duplicates in link deferred queue · 8306f99a
      Jon Paul Maloy 提交于
      After the previous commits, we are guaranteed that no packets
      of type LINK_PROTOCOL or with illegal sequence numbers will be
      attempted added to the link deferred queue. This makes it possible to
      make some simplifications to the sorting algorithm in the function
      tipc_skb_queue_sorted().
      
      We also alter the function so that it will drop packets if one with
      the same seqeunce number is already present in the queue. This is
      necessary because we have identified weird packet sequences, involving
      duplicate packets, where a legitimate in-sequence packet may advance to
      the head of the queue without being detected and de-queued.
      
      Finally, we make this function outline, since it will now be called only
      in exceptional cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8306f99a
    • J
      tipc: improve sequence number checking · 81204c49
      Jon Paul Maloy 提交于
      The sequence number of an incoming packet is currently only checked
      for less than, equality to, or bigger than the next expected number,
      meaning that the receive window in practice becomes one half sequence
      number cycle, or U16_MAX/2. This does not make sense, and may not even
      be safe if there are extreme delays in the network. Any packet sent by
      the peer during the ongoing cycle must belong inside his current send
      window, or should otherwise be dropped if possible.
      
      Since a link endpoint cannot know its peer's current send window, it
      has to base this sanity check on a worst-case assumption, i.e., that
      the peer is using a maximum sized window of 8191 packets. Using this
      assumption, we now add a check that the sequence number is not bigger
      than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
      done, so that the receive window test is performed before the gap test.
      This way, we are guaranteed that no packet with illegal sequence numbers
      are ever added to the deferred queue.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81204c49
    • J
      tipc: simplify tipc_link_rcv() reception loop · f9aa358a
      Jon Paul Maloy 提交于
      Currently, all packets received in tipc_link_rcv() are unconditionally
      added to the packet deferred queue, whereafter that queue is walked and
      all its buffers evaluated for delivery. This is both non-optimal and
      and makes the queue sorting function unnecessary complex.
      
      This commit changes the loop so that an arrived packet is evaluated
      first, and added to the deferred queue only when a sequence number gap
      is discovered. A non-empty deferred queue is walked until it is empty
      or until its head's sequence number doesn't fit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9aa358a