1. 06 3月, 2014 1 次提交
    • N
      net: fix for a race condition in the inet frag code · 24b9bf43
      Nikolay Aleksandrov 提交于
      I stumbled upon this very serious bug while hunting for another one,
      it's a very subtle race condition between inet_frag_evictor,
      inet_frag_intern and the IPv4/6 frag_queue and expire functions
      (basically the users of inet_frag_kill/inet_frag_put).
      
      What happens is that after a fragment has been added to the hash chain
      but before it's been added to the lru_list (inet_frag_lru_add) in
      inet_frag_intern, it may get deleted (either by an expired timer if
      the system load is high or the timer sufficiently low, or by the
      fraq_queue function for different reasons) before it's added to the
      lru_list, then after it gets added it's a matter of time for the
      evictor to get to a piece of memory which has been freed leading to a
      number of different bugs depending on what's left there.
      
      I've been able to trigger this on both IPv4 and IPv6 (which is normal
      as the frag code is the same), but it's been much more difficult to
      trigger on IPv4 due to the protocol differences about how fragments
      are treated.
      
      The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
      in a RR bond, so the same flow can be seen on multiple cards at the
      same time. Then I used multiple instances of ping/ping6 to generate
      fragmented packets and flood the machines with them while running
      other processes to load the attacked machine.
      
      *It is very important to have the _same flow_ coming in on multiple CPUs
      concurrently. Usually the attacked machine would die in less than 30
      minutes, if configured properly to have many evictor calls and timeouts
      it could happen in 10 minutes or so.
      
      An important point to make is that any caller (frag_queue or timer) of
      inet_frag_kill will remove both the timer refcount and the
      original/guarding refcount thus removing everything that's keeping the
      frag from being freed at the next inet_frag_put.  All of this could
      happen before the frag was ever added to the LRU list, then it gets
      added and the evictor uses a freed fragment.
      
      An example for IPv6 would be if a fragment is being added and is at
      the stage of being inserted in the hash after the hash lock is
      released, but before inet_frag_lru_add executes (or is able to obtain
      the lru lock) another overlapping fragment for the same flow arrives
      at a different CPU which finds it in the hash, but since it's
      overlapping it drops it invoking inet_frag_kill and thus removing all
      guarding refcounts, and afterwards freeing it by invoking
      inet_frag_put which removes the last refcount added previously by
      inet_frag_find, then inet_frag_lru_add gets executed by
      inet_frag_intern and we have a freed fragment in the lru_list.
      
      The fix is simple, just move the lru_add under the hash chain locked
      region so when a removing function is called it'll have to wait for
      the fragment to be added to the lru_list, and then it'll remove it (it
      works because the hash chain removal is done before the lru_list one
      and there's no window between the two list adds when the frag can get
      dropped). With this fix applied I couldn't kill the same machine in 24
      hours with the same setup.
      
      Fixes: 3ef0eb0d ("net: frag, move LRU list maintenance outside of
      rwlock")
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Jesper Dangaard Brouer <brouer@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b9bf43
  2. 04 3月, 2014 4 次提交
    • D
      net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable · ec0223ec
      Daniel Borkmann 提交于
      RFC4895 introduced AUTH chunks for SCTP; during the SCTP
      handshake RANDOM; CHUNKS; HMAC-ALGO are negotiated (CHUNKS
      being optional though):
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
      
      A special case is when an endpoint requires COOKIE-ECHO
      chunks to be authenticated:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        ------------------ AUTH; COOKIE-ECHO ---------------->
        <-------------------- COOKIE-ACK ---------------------
      
      RFC4895, section 6.3. Receiving Authenticated Chunks says:
      
        The receiver MUST use the HMAC algorithm indicated in
        the HMAC Identifier field. If this algorithm was not
        specified by the receiver in the HMAC-ALGO parameter in
        the INIT or INIT-ACK chunk during association setup, the
        AUTH chunk and all the chunks after it MUST be discarded
        and an ERROR chunk SHOULD be sent with the error cause
        defined in Section 4.1. [...] If no endpoint pair shared
        key has been configured for that Shared Key Identifier,
        all authenticated chunks MUST be silently discarded. [...]
      
        When an endpoint requires COOKIE-ECHO chunks to be
        authenticated, some special procedures have to be followed
        because the reception of a COOKIE-ECHO chunk might result
        in the creation of an SCTP association. If a packet arrives
        containing an AUTH chunk as a first chunk, a COOKIE-ECHO
        chunk as the second chunk, and possibly more chunks after
        them, and the receiver does not have an STCB for that
        packet, then authentication is based on the contents of
        the COOKIE-ECHO chunk. In this situation, the receiver MUST
        authenticate the chunks in the packet by using the RANDOM
        parameters, CHUNKS parameters and HMAC_ALGO parameters
        obtained from the COOKIE-ECHO chunk, and possibly a local
        shared secret as inputs to the authentication procedure
        specified in Section 6.3. If authentication fails, then
        the packet is discarded. If the authentication is successful,
        the COOKIE-ECHO and all the chunks after the COOKIE-ECHO
        MUST be processed. If the receiver has an STCB, it MUST
        process the AUTH chunk as described above using the STCB
        from the existing association to authenticate the
        COOKIE-ECHO chunk and all the chunks after it. [...]
      
      Commit bbd0d598 introduced the possibility to receive
      and verification of AUTH chunk, including the edge case for
      authenticated COOKIE-ECHO. On reception of COOKIE-ECHO,
      the function sctp_sf_do_5_1D_ce() handles processing,
      unpacks and creates a new association if it passed sanity
      checks and also tests for authentication chunks being
      present. After a new association has been processed, it
      invokes sctp_process_init() on the new association and
      walks through the parameter list it received from the INIT
      chunk. It checks SCTP_PARAM_RANDOM, SCTP_PARAM_HMAC_ALGO
      and SCTP_PARAM_CHUNKS, and copies them into asoc->peer
      meta data (peer_random, peer_hmacs, peer_chunks) in case
      sysctl -w net.sctp.auth_enable=1 is set. If in INIT's
      SCTP_PARAM_SUPPORTED_EXT parameter SCTP_CID_AUTH is set,
      peer_random != NULL and peer_hmacs != NULL the peer is to be
      assumed asoc->peer.auth_capable=1, in any other case
      asoc->peer.auth_capable=0.
      
      Now, if in sctp_sf_do_5_1D_ce() chunk->auth_chunk is
      available, we set up a fake auth chunk and pass that on to
      sctp_sf_authenticate(), which at latest in
      sctp_auth_calculate_hmac() reliably dereferences a NULL pointer
      at position 0..0008 when setting up the crypto key in
      crypto_hash_setkey() by using asoc->asoc_shared_key that is
      NULL as condition key_id == asoc->active_key_id is true if
      the AUTH chunk was injected correctly from remote. This
      happens no matter what net.sctp.auth_enable sysctl says.
      
      The fix is to check for net->sctp.auth_enable and for
      asoc->peer.auth_capable before doing any operations like
      sctp_sf_authenticate() as no key is activated in
      sctp_auth_asoc_init_active_key() for each case.
      
      Now as RFC4895 section 6.3 states that if the used HMAC-ALGO
      passed from the INIT chunk was not used in the AUTH chunk, we
      SHOULD send an error; however in this case it would be better
      to just silently discard such a maliciously prepared handshake
      as we didn't even receive a parameter at all. Also, as our
      endpoint has no shared key configured, section 6.3 says that
      MUST silently discard, which we are doing from now onwards.
      
      Before calling sctp_sf_pdiscard(), we need not only to free
      the association, but also the chunk->auth_chunk skb, as
      commit bbd0d598 created a skb clone in that case.
      
      I have tested this locally by using netfilter's nfqueue and
      re-injecting packets into the local stack after maliciously
      modifying the INIT chunk (removing RANDOM; HMAC-ALGO param)
      and the SCTP packet containing the COOKIE_ECHO (injecting
      AUTH chunk before COOKIE_ECHO). Fixed with this patch applied.
      
      Fixes: bbd0d598 ("[SCTP]: Implement the receive and verification of AUTH chunk")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Vlad Yasevich <yasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0223ec
    • X
      ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer · 10ddceb2
      Xin Long 提交于
      when ip_tunnel process multicast packets, it may check if the packet is looped
      back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(),
      but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(),
      so which leads to a panic.
      
      fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10ddceb2
    • Y
      tcp: fix bogus RTT on special retransmission · c84a5711
      Yuchung Cheng 提交于
      RTT may be bogus with tall loss probe (TLP) when a packet
      is retransmitted and latter (s)acked without TCPCB_SACKED_RETRANS flag.
      
      For example, TLP calls __tcp_retransmit_skb() instead of
      tcp_retransmit_skb(). The skb timestamps are updated but the sacked
      flag is not marked with TCPCB_SACKED_RETRANS. As a result we'll
      get bogus RTT in tcp_clean_rtx_queue() or in tcp_sacktag_one() on
      spurious retransmission.
      
      The fix is to apply the sticky flag TCP_EVER_RETRANS to enforce Karn's
      check on RTT sampling. However this will disable F-RTO if timeout occurs
      after TLP, by resetting undo_marker in tcp_enter_loss(). We relax this
      check to only if any pending retransmists are still in-flight.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c84a5711
    • D
      hsr: off by one sanity check in hsr_register_frame_in() · de39d7a4
      Dan Carpenter 提交于
      This is a sanity check and we never pass invalid values so this patch
      doesn't change anything.  However the node->time_in[] array has
      HSR_MAX_SLAVE (2) elements and not HSR_MAX_DEV (3).
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de39d7a4
  3. 03 3月, 2014 1 次提交
    • O
      can: remove CAN FD compatibility for CAN 2.0 sockets · 821047c4
      Oliver Hartkopp 提交于
      In commit e2d265d3 (canfd: add support for CAN FD in CAN_RAW sockets)
      CAN FD frames with a payload length up to 8 byte are passed to legacy
      sockets where the CAN FD support was not enabled by the application.
      
      After some discussions with developers at a fair this well meant feature
      leads to confusion as no clean switch for CAN / CAN FD is provided to the
      application programmer. Additionally a compatibility like this for legacy
      CAN_RAW sockets requires some compatibility handling for the sending, e.g.
      make CAN2.0 frames a CAN FD frame with BRS at transmission time (?!?).
      
      This will become a mess when people start to develop applications with
      real CAN FD hardware. This patch reverts the bad compatibility code
      together with the documentation describing the removed feature.
      Acked-by: NStephane Grosjean <s.grosjean@peak-system.com>
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      821047c4
  4. 28 2月, 2014 5 次提交
  5. 27 2月, 2014 1 次提交
    • E
      net: tcp: use NET_INC_STATS() · 9a9bfd03
      Eric Dumazet 提交于
      While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented
      in tcp_transmit_skb() from softirq (incoming message or timer
      activation), it is better to use NET_INC_STATS() instead of
      NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process
      context.
      
      This will avoid copy/paste confusion when/if we want to add
      other SNMP counters in tcp_transmit_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a9bfd03
  6. 26 2月, 2014 3 次提交
    • S
      xfrm: Fix unlink race when policies are deleted. · 3a9016f9
      Steffen Klassert 提交于
      When a policy is unlinked from the lists in thread context,
      the xfrm timer can fire before we can mark this policy as dead.
      So reinitialize the bydst hlist, then hlist_unhashed() will
      notice that this policy is not linked and will avoid a
      doulble unlink of that policy.
      Reported-by: NXianpeng Zhao <673321875@qq.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      3a9016f9
    • M
      net: Fix permission check in netlink_connect() · 46833a86
      Mike Pecovnik 提交于
      netlink_sendmsg() was changed to prevent non-root processes from sending
      messages with dst_pid != 0.
      netlink_connect() however still only checks if nladdr->nl_groups is set.
      This patch modifies netlink_connect() to check for the same condition.
      Signed-off-by: NMike Pecovnik <mike.pecovnik@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46833a86
    • H
      ipv4: ipv6: better estimate tunnel header cut for correct ufo handling · 91a48a2e
      Hannes Frederic Sowa 提交于
      Currently the UFO fragmentation process does not correctly handle inner
      UDP frames.
      
      (The following tcpdumps are captured on the parent interface with ufo
      disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
      both sit device):
      
      IPv6:
      16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
      16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
      
      We can see that fragmentation header offset is not correctly updated.
      (fragmentation id handling is corrected by 916e4cf4 ("ipv6: reuse
      ip6_frag_id from ip6_ufo_append_data")).
      
      IPv4:
      16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
          192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
      16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
          192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
      
      In this case fragmentation id is incremented and offset is not updated.
      
      First, I aligned inet_gso_segment and ipv6_gso_segment:
      * align naming of flags
      * ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
        always ensure that the state of this flag is left untouched when
        returning from upper gso segmenation function
      * ipv6_gso_segment: move skb_reset_inner_headers below updating the
        fragmentation header data, we don't care for updating fragmentation
        header data
      * remove currently unneeded comment indicating skb->encapsulation might
        get changed by upper gso_segment callback (gre and udp-tunnel reset
        encapsulation after segmentation on each fragment)
      
      If we encounter an IPIP or SIT gso skb we now check for the protocol ==
      IPPROTO_UDP and that we at least have already traversed another ip(6)
      protocol header.
      
      The reason why we have to special case GSO_IPIP and GSO_SIT is that
      we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
      protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
      Reported-by: NWolfgang Walter <linux@stwm.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91a48a2e
  7. 25 2月, 2014 2 次提交
    • J
      cfg80211: regulatory: reset regdomain in case of error · 092008ab
      Janusz Dziedzic 提交于
      Reset regdomain to world regdomain in case
      of errors in set_regdom() function.
      
      This will fix a problem with such scenario:
      - iw reg set US
      - iw reg set 00
      - iw reg set US
      The last step always fail and we get deadlock
      in kernel regulatory code. Next setting new
      regulatory wasn't possible due to:
      
      Pending regulatory request, waiting for it to be processed...
      Signed-off-by: NJanusz Dziedzic <janusz.dziedzic@tieto.com>
      Acked-by: NLuis R. Rodriguez <mcgrof@do-not-panic.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      092008ab
    • E
      tcp: reduce the bloat caused by tcp_is_cwnd_limited() · d10473d4
      Eric Dumazet 提交于
      tcp_is_cwnd_limited() allows GSO/TSO enabled flows to increase
      their cwnd to allow a full size (64KB) TSO packet to be sent.
      
      Non GSO flows only allow an extra room of 3 MSS.
      
      For most flows with a BDP below 10 MSS, this results in a bloat
      of cwnd reaching 90, and an inflate of RTT.
      
      Thanks to TSO auto sizing, we can restrict the bloat to the number
      of MSS contained in a TSO packet (tp->xmit_size_goal_segs), to keep
      original intent without performance impact.
      
      Because we keep cwnd small, it helps to keep TSO packet size to their
      optimal value.
      
      Example for a 10Mbit flow, with low TCP Small queue limits (no more than
      2 skb in qdisc/device tx ring)
      
      Before patch :
      
      lpk51:~# ./ss -i dst lpk52:44862 | grep cwnd
               cubic wscale:6,6 rto:215 rtt:15.875/2.5 mss:1448 cwnd:96
      ssthresh:96
      send 70.1Mbps unacked:14 rcv_space:29200
      
      After patch :
      
      lpk51:~# ./ss -i dst lpk52:52916 | grep cwnd
               cubic wscale:6,6 rto:206 rtt:5.206/0.036 mss:1448 cwnd:15
      ssthresh:14
      send 33.4Mbps unacked:4 rcv_space:29200
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d10473d4
  8. 24 2月, 2014 2 次提交
  9. 22 2月, 2014 6 次提交
    • H
      ipv6: reuse ip6_frag_id from ip6_ufo_append_data · 916e4cf4
      Hannes Frederic Sowa 提交于
      Currently we generate a new fragmentation id on UFO segmentation. It
      is pretty hairy to identify the correct net namespace and dst there.
      Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst
      available at all.
      
      This causes unreliable or very predictable ipv6 fragmentation id
      generation while segmentation.
      
      Luckily we already have pregenerated the ip6_frag_id in
      ip6_ufo_append_data and can use it here.
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      916e4cf4
    • D
      net: sctp: rework multihoming retransmission path selection to rfc4960 · 4c47af4d
      Daniel Borkmann 提交于
      Problem statement: 1) both paths (primary path1 and alternate
      path2) are up after the association has been established i.e.,
      HB packets are normally exchanged, 2) path2 gets inactive after
      path_max_retrans * max_rto timed out (i.e. path2 is down completely),
      3) now, if a transmission times out on the only surviving/active
      path1 (any ~1sec network service impact could cause this like
      a channel bonding failover), then the retransmitted packets are
      sent over the inactive path2; this happens with partial failover
      and without it.
      
      Besides not being optimal in the above scenario, a small failure
      or timeout in the only existing path has the potential to cause
      long delays in the retransmission (depending on RTO_MAX) until
      the still active path is reselected. Further, when the T3-timeout
      occurs, we have active_patch == retrans_path, and even though the
      timeout occurred on the initial transmission of data, not a
      retransmit, we end up updating retransmit path.
      
      RFC4960, section 6.4. "Multi-Homed SCTP Endpoints" states under
      6.4.1. "Failover from an Inactive Destination Address" the
      following:
      
        Some of the transport addresses of a multi-homed SCTP endpoint
        may become inactive due to either the occurrence of certain
        error conditions (see Section 8.2) or adjustments from the
        SCTP user.
      
        When there is outbound data to send and the primary path
        becomes inactive (e.g., due to failures), or where the SCTP
        user explicitly requests to send data to an inactive
        destination transport address, before reporting an error to
        its ULP, the SCTP endpoint should try to send the data to an
        alternate __active__ destination transport address if one
        exists.
      
        When retransmitting data that timed out, if the endpoint is
        multihomed, it should consider each source-destination address
        pair in its retransmission selection policy. When retransmitting
        timed-out data, the endpoint should attempt to pick the most
        divergent source-destination pair from the original
        source-destination pair to which the packet was transmitted.
      
        Note: Rules for picking the most divergent source-destination
        pair are an implementation decision and are not specified
        within this document.
      
      So, we should first reconsider to take the current active
      retransmission transport if we cannot find an alternative
      active one. If all of that fails, we can still round robin
      through unkown, partial failover, and inactive ones in the
      hope to find something still suitable.
      
      Commit 4141ddc0 ("sctp: retran_path update bug fix") broke
      that behaviour by selecting the next inactive transport when
      no other active transport was found besides the current assoc's
      peer.retran_path. Before commit 4141ddc0, we would have
      traversed through the list until we reach our peer.retran_path
      again, and in case that is still in state SCTP_ACTIVE, we would
      take it and return. Only if that is not the case either, we
      take the next inactive transport.
      
      Besides all that, another issue is that transports in state
      SCTP_UNKNOWN could be preferred over transports in state
      SCTP_ACTIVE in case a SCTP_ACTIVE transport appears after
      SCTP_UNKNOWN in the transport list yielding a weaker transport
      state to be used in retransmission.
      
      This patch mostly reverts 4141ddc0, but also rewrites
      this function to introduce more clarity and strictness into
      the code. A strict priority of transport states is enforced
      in this patch, hence selection is active > unkown > partial
      failover > inactive.
      
      Fixes: 4141ddc0 ("sctp: retran_path update bug fix")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: NVlad Yasevich <yasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c47af4d
    • J
      neigh: fix setting of default gc_* values · b194c1f1
      Jiri Pirko 提交于
      This patch fixes bug introduced by:
      commit 1d4c8c29
      "neigh: restore old behaviour of default parms values"
      
      The thing is that in neigh_sysctl_register, extra1 and extra2 which were
      previously set for NEIGH_VAR_GC_* are overwritten. That leads to
      nonsense int limits for gc_* variables. So fix this by not touching
      extra* fields for gc_* variables.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b194c1f1
    • E
      net-tcp: fastopen: fix high order allocations · f5ddcbbb
      Eric Dumazet 提交于
      This patch fixes two bugs in fastopen :
      
      1) The tcp_sendmsg(...,  @size) argument was ignored.
      
         Code was relying on user not fooling the kernel with iovec mismatches
      
      2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5
      allocations, which are likely to fail when memory gets fragmented.
      
      Fixes: 783237e8 ("net-tcp: Fast Open client - sending SYN-data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Tested-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5ddcbbb
    • Y
      tipc: make bearer set up in module insertion stage · 970122fd
      Ying Xue 提交于
      Accidentally a side effect is involved by commit 6e967adf(tipc:
      relocate common functions from media to bearer). Now tipc stack
      handler of receiving packets from netdevices as well as netdevice
      notification handler are registered when bearer is enabled rather
      than tipc module initialization stage, but the two handlers are
      both unregistered in tipc module exit phase. If tipc module is
      inserted and then immediately removed, the following warning
      message will appear:
      
      "dev_remove_pack: ffffffffa0380940 not found"
      
      This is because in module insertion stage tipc stack packet handler
      is not registered at all, but in module exit phase dev_remove_pack()
      needs to remove it. Of course, dev_remove_pack() cannot find tipc
      protocol handler from the kernel protocol handler list so that the
      warning message is printed out.
      
      But if registering the two handlers is adjusted from enabling bearer
      phase into inserting module stage, the warning message will be
      eliminated. Due to this change, tipc_core_start_net() and
      tipc_core_stop_net() can be deleted as well.
      Reported-by: NWang Weidong <wangweidong1@huawei.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Erik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      970122fd
    • Y
      tipc: remove all enabled flags from all tipc components · 9fe7ed47
      Ying Xue 提交于
      When tipc module is inserted, many tipc components are initialized
      one by one. During the initialization period, if one of them is
      failed, tipc_core_stop() will be called to stop all components
      whatever corresponding components are created or not. To avoid to
      release uncreated ones, relevant components have to add necessary
      enabled flags indicating whether they are created or not.
      
      But in the initialization stage, if one component is unsuccessfully
      created, we will just destroy successfully created components before
      the failed component instead of all components. All enabled flags
      defined in components, in turn, become redundant. Additionally it's
      also unnecessary to identify whether table.types is NULL in
      tipc_nametbl_stop() because name stable has been definitely created
      successfully when tipc_nametbl_stop() is called.
      
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Erik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fe7ed47
  10. 21 2月, 2014 2 次提交
    • M
      net: sctp: Potentially-Failed state should not be reached from unconfirmed state · 7cce3b75
      Matija Glavinic Pecotic 提交于
      In current implementation it is possible to reach PF state from unconfirmed.
      We can interpret sctp-failover-02 in a way that PF state is meant to be reached
      only from active state, in the end, this is when entering PF state makes sense.
      Here are few quotes from sctp-failover-02, but regardless of these, same
      understanding can be reached from whole section 5:
      
      Section 5.1, quickfailover guide:
          "The PF state is an intermediate state between Active and Failed states."
      
          "Each time the T3-rtx timer expires on an active or idle
          destination, the error counter of that destination address will
          be incremented.  When the value in the error counter exceeds
          PFMR, the endpoint should mark the destination transport address as PF."
      
      There are several concrete reasons for such interpretation. For start, rfc4960
      does not take into concern quickfailover algorithm. Therefore, quickfailover
      must comply to 4960. Point where this compliance can be argued is following
      behavior:
      When PF is entered, association overall error counter is incremented for each
      missed HB. This is contradictory to rfc4960, as address, while in unconfirmed
      state, is subjected to probing, and while it is probed, it should not increment
      association overall error counter. This has as a consequence that we might end
      up in situation in which we drop association due path failure on unconfirmed
      address, in case we have wrong configuration in a way:
      Association.Max.Retrans == Path.Max.Retrans.
      
      Another reason is that entering PF from unconfirmed will cause a loss of address
      confirmed event when address is once (if) confirmed. This is fine from failover
      guide point of view, but it is not consistent with behavior preceding failover
      implementation and recommendation from 4960:
      
      5.4.  Path Verification
         Whenever a path is confirmed, an indication MAY be given to the upper
         layer.
      Signed-off-by: NMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7cce3b75
    • N
      sit: fix panic with route cache in ip tunnels · cf71d2bc
      Nicolas Dichtel 提交于
      Bug introduced by commit 7d442fab ("ipv4: Cache dst in tunnels").
      
      Because sit code does not call ip_tunnel_init(), the dst_cache was not
      initialized.
      
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf71d2bc
  11. 20 2月, 2014 8 次提交
    • S
      xfrm: Clone states properly on migration · ee5c2317
      Steffen Klassert 提交于
      We loose a lot of information of the original state if we
      clone it with xfrm_state_clone(). In particular, there is
      no crypto algorithm attached if the original state uses
      an aead algorithm. This patch add the missing information
      to the clone state.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      ee5c2317
    • S
      xfrm: Take xfrm_state_lock in xfrm_migrate_state_find · 8c0cba22
      Steffen Klassert 提交于
      A comment on xfrm_migrate_state_find() says that xfrm_state_lock
      is held. This is apparently not the case, but we need it to
      traverse through the state lists.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      8c0cba22
    • S
      xfrm: Fix NULL pointer dereference on sub policy usage · 35ea790d
      Steffen Klassert 提交于
      xfrm_state_sort() takes the unsorted states from the src array
      and stores them into the dst array. We try to get the namespace
      from the dst array which is empty at this time, so take the
      namespace from the src array instead.
      
      Fixes: 283bc9f3 ("xfrm: Namespacify xfrm state/policy locks")
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      35ea790d
    • S
      ip6_vti: Fix build when NET_IP_TUNNEL is not set. · 876fc03a
      Steffen Klassert 提交于
      Since commit 469bdcef ip6_vti uses ip_tunnel_get_stats64(),
      so we need to select NET_IP_TUNNEL to have this function available.
      
      Fixes: 469bdcef ("ipv6: fix the use of pcpu_tstats in ip6_vti.c")
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      876fc03a
    • J
      mac80211: fix station wakeup powersave race · e3685e03
      Johannes Berg 提交于
      Consider the following (relatively unlikely) scenario:
       1) station goes to sleep while frames are buffered in driver
       2) driver blocks wakeup (until no more frames are buffered)
       3) station wakes up again
       4) driver unblocks wakeup
      
      In this case, the current mac80211 code will do the following:
       1) WLAN_STA_PS_STA set
       2) WLAN_STA_PS_DRIVER set
       3) - nothing -
       4) WLAN_STA_PS_DRIVER cleared
      
      As a result, no frames will be delivered to the client, even
      though it is awake, until it sends another frame to us that
      triggers ieee80211_sta_ps_deliver_wakeup() in sta_ps_end().
      
      Since we now take the PS spinlock, we can fix this while at
      the same time removing the complexity with the pending skb
      queue function. This was broken since my commit 50a9432d
      ("mac80211: fix powersaving clients races") due to removing
      the clearing of WLAN_STA_PS_STA in the RX path.
      
      While at it, fix a cleanup path issue when a station is
      removed while the driver is still blocking its wakeup.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      e3685e03
    • J
      mac80211: insert stations before adding to driver · 5108ca82
      Johannes Berg 提交于
      There's a race condition in mac80211 because we add stations
      to the internal lists after adding them to the driver, which
      means that (for example) the following can happen:
       1. a station connects and is added
       2. first, it is added to the driver
       3. then, it is added to the mac80211 lists
      
      If the station goes to sleep between steps 2 and 3, and the
      firmware/hardware records it as being asleep, mac80211 will
      never instruct the driver to wake it up again as it never
      realized it went to sleep since the RX path discarded the
      frame as a "spurious class 3 frame", no station entry was
      present yet.
      
      Fix this by adding the station in software first, and only
      then adding it to the driver. That way, any state that the
      driver changes will be reflected properly in mac80211's
      station state. The problematic part is the roll-back if the
      driver fails to add the station, in that case a bit more is
      needed. To not make that overly complex prevent starting BA
      sessions in the meantime.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      5108ca82
    • E
      mac80211: fix AP powersave TX vs. wakeup race · 1d147bfa
      Emmanuel Grumbach 提交于
      There is a race between the TX path and the STA wakeup: while
      a station is sleeping, mac80211 buffers frames until it wakes
      up, then the frames are transmitted. However, the RX and TX
      path are concurrent, so the packet indicating wakeup can be
      processed while a packet is being transmitted.
      
      This can lead to a situation where the buffered frames list
      is emptied on the one side, while a frame is being added on
      the other side, as the station is still seen as sleeping in
      the TX path.
      
      As a result, the newly added frame will not be send anytime
      soon. It might be sent much later (and out of order) when the
      station goes to sleep and wakes up the next time.
      
      Additionally, it can lead to the crash below.
      
      Fix all this by synchronising both paths with a new lock.
      Both path are not fastpath since they handle PS situations.
      
      In a later patch we'll remove the extra skb queue locks to
      reduce locking overhead.
      
      BUG: unable to handle kernel
      NULL pointer dereference at 000000b0
      IP: [<ff6f1791>] ieee80211_report_used_skb+0x11/0x3e0 [mac80211]
      *pde = 00000000
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      EIP: 0060:[<ff6f1791>] EFLAGS: 00210282 CPU: 1
      EIP is at ieee80211_report_used_skb+0x11/0x3e0 [mac80211]
      EAX: e5900da0 EBX: 00000000 ECX: 00000001 EDX: 00000000
      ESI: e41d00c0 EDI: e5900da0 EBP: ebe458e4 ESP: ebe458b0
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      CR0: 8005003b CR2: 000000b0 CR3: 25a78000 CR4: 000407d0
      DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      DR6: ffff0ff0 DR7: 00000400
      Process iperf (pid: 3934, ti=ebe44000 task=e757c0b0 task.ti=ebe44000)
      iwlwifi 0000:02:00.0: I iwl_pcie_enqueue_hcmd Sending command LQ_CMD (#4e), seq: 0x0903, 92 bytes at 3[3]:9
      Stack:
       e403b32c ebe458c4 00200002 00200286 e403b338 ebe458cc c10960bb e5900da0
       ff76a6ec ebe458d8 00000000 e41d00c0 e5900da0 ebe458f0 ff6f1b75 e403b210
       ebe4598c ff723dc1 00000000 ff76a6ec e597c978 e403b758 00000002 00000002
      Call Trace:
       [<ff6f1b75>] ieee80211_free_txskb+0x15/0x20 [mac80211]
       [<ff723dc1>] invoke_tx_handlers+0x1661/0x1780 [mac80211]
       [<ff7248a5>] ieee80211_tx+0x75/0x100 [mac80211]
       [<ff7249bf>] ieee80211_xmit+0x8f/0xc0 [mac80211]
       [<ff72550e>] ieee80211_subif_start_xmit+0x4fe/0xe20 [mac80211]
       [<c149ef70>] dev_hard_start_xmit+0x450/0x950
       [<c14b9aa9>] sch_direct_xmit+0xa9/0x250
       [<c14b9c9b>] __qdisc_run+0x4b/0x150
       [<c149f732>] dev_queue_xmit+0x2c2/0xca0
      
      Cc: stable@vger.kernel.org
      Reported-by: NYaara Rozenblum <yaara.rozenblum@intel.com>
      Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Reviewed-by: NStanislaw Gruszka <sgruszka@redhat.com>
      [reword commit log, use a separate lock]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      1d147bfa
    • D
      ip_tunnel: Move ip_tunnel_get_stats64 into ip_tunnel_core.c · ebe44f35
      David S. Miller 提交于
      net/built-in.o:(.rodata+0x1707c): undefined reference to `ip_tunnel_get_stats64'
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebe44f35
  12. 19 2月, 2014 3 次提交
  13. 18 2月, 2014 2 次提交
    • P
      netfilter: ctnetlink: force null nat binding on insert · 0eba801b
      Pablo Neira Ayuso 提交于
      Quoting Andrey Vagin:
        When a conntrack is created  by kernel, it is initialized (sets
        IPS_{DST,SRC}_NAT_DONE_BIT bits in nf_nat_setup_info) and only then it
        is added in hashes (__nf_conntrack_hash_insert), so one conntract
        can't be initialized from a few threads concurrently.
      
        ctnetlink can add an uninitialized conntrack (w/o
        IPS_{DST,SRC}_NAT_DONE_BIT) in hashes, then a few threads can look up
        this conntrack and start initialize it concurrently. It's dangerous,
        because BUG can be triggered from nf_nat_setup_info.
      
      Fix this race by always setting up nat, even if no CTA_NAT_ attribute
      was requested before inserting the ct into the hash table. In absence
      of CTA_NAT_ attribute, a null binding is created.
      
      This alters current behaviour: Before this patch, the first packet
      matching the newly injected conntrack would be run through the nat
      table since nf_nat_initialized() returns false.  IOW, this forces
      ctnetlink users to specify the desired nat transformation on ct
      creation time.
      
      Thanks for Florian Westphal, this patch is based on his original
      patch to address this problem, including this patch description.
      Reported-By: NAndrey Vagin <avagin@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      0eba801b
    • D
      ipv4: fix counter in_slow_tot · a6254864
      Duan Jiong 提交于
      since commit 89aef892("ipv4: Delete routing cache."), the counter
      in_slow_tot can't work correctly.
      
      The counter in_slow_tot increase by one when fib_lookup() return successfully
      in ip_route_input_slow(), but actually the dst struct maybe not be created and
      cached, so we can increase in_slow_tot after the dst struct is created.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6254864